Summary: Information Retrieval, tf-idf, Elasticsearch, Text Matching

What is TF-IDF?

TF-IDF stands for "Term Frequency — Inverse Document Frequency". It is a statistical technique that quantifies the importance of a word in a document based on how often it appears in that document and a given collection of documents (corpus). The intuition for this measure is : If a word occurs frequently in a document, then it should be more important and relevant than other words that appear fewer times and we should give that word a high score (TF). But if a word appears many times in a document but also in too many other documents, it’s probably not a relevant and meaningful word, therefore we should assign a lower score to that word (IDF). The relevancy of a word is proportional to the amount of information that it gives about its context (a sentence, a document or a full dataset). The more relevant words help us better understand the entire document without reading it completely. The most relevant words are not necessary the most frequent words since stopwords like "the", "of" or "a" tend to occur very often in many documents, but do not give much information. TF-IDF method is widely used in Information Retrieval and Text Mining. The TF-IDF score of term $t$ in document $d$ with respect to corpus $D$ is:

$$tfidf(t,d,D)=tf(t,d)\times idf(d,D)$$

Term Frequency (TF) Score

First we have to calculate the $tf(t,d)$, which is simply the number of times each word $t$ appeared in document $d$. While calculating $tf(t,d)$, we usually remove words like "a", "as", "the". These words are called stopwords and will not provide much information. Additionally, there could be many high frequency non-stopwords that do not provide much information in a given context (e.g., “Disney” in a collection of documents about “Disney World”), Therefore, we can filter them out too. Also, we normalize the term-frequency to make sure there is no bias for longer or shorter documents. Thus, we have:

$$tf(t,d)=\frac{f_{t,d}}{\sum_{t'} f_{t',d}}$$

where $f_{t,d}$ is the number of occurances of $t$ in $d$.

Inverse Document Frequency (IDF)

It measures how rare a term $t$ is across the corpus $D$, meaning how much information it provides about a document it appears in. If total number of documents in the corpus is $N=|D|$, and $n_t$ is the number of documents having $t$, then we have:

$$idf(t,D)=\log(\frac{N}{n_t})$$

The reason that we take the $\log$ of IDF is that if we have a large corpus, the IDF values will become so large, therefore, we use the $\log$ value to decrease that effect.

TF-IDF Example

In order to fully understand how TF-IDF works, I will give you a concrete example. Let's assume that we have a collection of four documents as follows:

  • $d_1$: "The sky is blue.

  • $d_2$: "The sun is bright today."

  • $d_3$: "The sun in the sky is bright."

  • $d_4$: "We can see the shining sun, the bright sun."

Task: Determine the tf-idf scores for each term in each document.

  • Step1: Filter out the stopwords. After removing the stopwords, we have

    • $d_1$: "sky blue

    • $d_2$: "sun bright today"

    • $d_3$: "sun sky bright"

    • $d_4$: "can see shining sun bright sun"

  • Step2: Compute TF, therefore, we find document-word matrix and then normalize the rows to sum to 1.

TF score computation. [Image Source]

  • Step3: Compute IDF: Find the number of documents in which each word occurs, then compute the formula:

IDF score computation. [Image Source]

  • Step4: Compute TF-IDF: Multiply TF and IDF scores.

TF-IDF score computation. [Image Source]

Application of tf-idf for Searching Text

In order to understand how to use tf-idf, I am going to make use of this technique in a text searching application. I will use a dataset of Python questions and answers from Stackoverflow. The dataset contains all the questions (around 700,000) asked between August 2, 2008 and Ocotober 19, 2016. Please see the link for all the details about this dataset. For this application, I only use the Python questions. However, it would be an interesting exercise to create a question-answering application. Each question in this file contains title and body among other attributes. But I will merely use these two fields from each question in the file.

Implementation

import os
import re
import time
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity 

preprocess[source]

preprocess(title, body=None)

Preprocess the input, i.e. lowercase, remove html tags, special character and digits.

create_tfidf_features[source]

create_tfidf_features(corpus, max_features=5000, max_df=0.95, min_df=2)

Creates a tf-idf matrix for the corpus using sklearn.

calculate_similarity[source]

calculate_similarity(X, vectorizor, query, top_k=5)

Vectorizes the query via vectorizor and calculates the cosine similarity of the query and X (all the documents) and returns the top_k similar documents.

show_similar_documents[source]

show_similar_documents(df, cosine_similarities, similar_doc_indices)

Prints the most similar documents using indices in the similar_doc_indices vector.

First step is to read and load the questions.csv file. Please note that it may take several seconds to fully load the file due to large number of records (607282 questions) in the file.

# Reading the csv file of python Questions
FILE_PATH = os.path.join('data','stackoverflow','Questions.csv')
print('Reading the Questions file...')
df = pd.read_csv(FILE_PATH, delimiter=',', encoding='ISO-8859-1')
print('done')
Reading the Questions file...
done
df.head()
Id OwnerUserId CreationDate Score Title Body
0 469 147.0 2008-08-02T15:11:16Z 21 How can I find the full path to a font from it... <p>I am using the Photoshop's javascript API t...
1 502 147.0 2008-08-02T17:01:58Z 27 Get a preview JPEG of a PDF on Windows? <p>I have a cross-platform (Python) applicatio...
2 535 154.0 2008-08-02T18:43:54Z 40 Continuous Integration System for a Python Cod... <p>I'm starting work on a hobby project with a...
3 594 116.0 2008-08-03T01:15:08Z 25 cx_Oracle: How do I iterate over a result set? <p>There are several ways to iterate over a re...
4 683 199.0 2008-08-03T13:19:16Z 28 Using 'in' to match an attribute of Python obj... <p>I don't remember whether I was dreaming or ...

Let's take a look at a randomly selected question.

sample_index = np.random.randint(len(df))
sample = df.loc[sample_index,['Title', 'Body']]
print('title: {}, \nbody: {}'.format(sample['Title'],sample['Body']))
title: Python: print() always writes to .ipynb making file too large, 
body: <p>To cut to the chase, I'd like to know how to make print statements display in ipython notebooks while simultaneously preventing those statements from saving to the .ipynb file. The purpose being to make a progress bar which doesn't make the file size ridiculously large.</p>

<p>The background to this is that I've been writing a bit of python code which makes a bunch of png files so that I can eventually compile them into a GIF. While I was doing this I thought I'd be clever and print the progress of the task as it went using <code>print()</code> from <code>__future__</code> with carriage returns. Unfortunately though I've had two problem with my code, both of which I imagine are related to my implementation of this progress message.</p>

<p>The first problem is with github's limit on file size:</p>

<p>When I first tried to upload my file to github it prevented me from doing so because it exceeded their 100 MB limit. After investigating my .ipynb file I found that there was an obscene number of print statements which were being saved there. Initially I'd thought that including <code>'\r'</code> to do carriage returns would prevent this, but clearly that's not the case.</p>

<p>The second problem is probably related to this:</p>

<p>Typically I don't have a problem creating the first few GIFs especially if I don't include that many frames, however beyond that my python notebook crashes. If this were a typical memory problem I'd imagine that it would just throw an error at me, but it doesn't, and instead promptly dies on me.</p>

<p>Here's a sample of the sort of stuff that's bloating the .ipynb file:</p>

<pre><code>{
   "output_type": "stream",
   "stream": "stdout",
   "text": [
    "\r",
    "frame 1 -- 0.480% complete -- U_avg = -7.200000\r",
    "frame 1 -- 0.500% complete -- U_avg = -7.200000\r",
    "frame 1 -- 0.520% complete -- U_avg = -7.200000\r",
    "frame 1 -- 0.540% complete -- U_avg = -7.200000\r",
    "frame 1 -- 0.560% complete -- U_avg = -7.200000\r",
    "frame 1 -- 0.580% complete -- U_avg = -7.200000\r",
    "frame 1 -- 0.600% complete -- U_avg = -7.200000\r",
    "frame 1 -- 0.620% complete -- U_avg = -7.200000\r",
    "frame 1 -- 0.640% complete -- U_avg = -7.200000\r",
    "frame 1 -- 0.660% complete -- U_avg = -7.200000\r",
    "frame 1 -- 0.680% complete -- U_avg = -7.200000\r",
    "frame 1 -- 0.700% complete -- U_avg = -7.200000\r",
    "frame 1 -- 0.720% complete -- U_avg = -7.200000\r", ......
</code></pre>

<p>I've looked into how other people have implemented progress bars, however they don't appear to do anything special which would actually prevent the problem I'm having. If it comes down to it, I wouldn't mind importing something which would solve this problem in a black box, but at the same time if I run into this issue in a different context it would be useful to know how to reduce .ipynb file sizes by cutting out saved print statements.</p>

<p>Thanks!</p>

Preprocessing is one of the major steps when we are dealing with any kind of text models. As you can see above, the body of the question, it's true for all the questions, has plently of html tags and special characters. Therefore, we need to get rid of them as much as we can. The preprocess() function will carry out cleaning of the questions by removing html tags, special characters and digits. As usual, there is always room for improvements by adding more cleaning rules such as stemming, lematization, stop words removal, etc.

# Preprocess the corpus
data = [preprocess(title, body) for title, body in zip(df['Title'], df['Body'])]

After we load and clean the data, it's time to create the term-document matrix. We can write simple functions for computing tf (term frequency) and idf (inverse document frequency). However, I leave this out as an interesting exercise. Instead I'll be using sklearn TfidfVectorizer to compute the word counts, idf and tf-idf values all at once. You can find all the details about TfidfVectorizer here. I would like to mention that in create_tfidf_features() function, I restrict the size of the vocabulary (i.e. number of features) to 5000 to make the computations cheaper.

print('creating tfidf matrix...')
# Learn vocabulary and idf, return term-document matrix
X,v = create_tfidf_features(data)
features = v.get_feature_names()
len(features)
creating tfidf matrix...
tfidf matrix successfully created.
5000

Now it's time to test and see how our application works. We can ask an arbitrary question and see if the system can find the top-k most similar questions from the dataset to our question.

user_question = ['how to loop over files in a directory']
search_start = time.time()
sim_vecs, cosine_similarities = calculate_similarity(X, v, user_question)
search_time = time.time() - search_start
print("search time: {:.2f} ms".format(search_time * 1000))
print()
show_similar_documents(data, cosine_similarities, sim_vecs)
search time: 380.51 ms

Top-1, Similarity = 0.6605864924688081
body: loop through fixed number of files within a directory how can i loop through a fixed number of files within a directory with glob glob if there s more than x files within that directory i only want to loop through x and then exit the loop how do i do this, 

Top-2, Similarity = 0.6436681516641347
body: how to list all files of a directory in python how can i list all files of a directory in python and add them to a list, 

Top-3, Similarity = 0.5968769006998226
body: extract all zipped files to same directory using python i have a large amount of zipped files in a single directory that i would like to decompress and save them to the same directory and with the same name as the zipped file, 

Top-4, Similarity = 0.5377592889154766
body: deleting all files in a directory with python i want to delete all files with the extension bak in a directory how can i do that in python, 

Top-5, Similarity = 0.5202741533554684
body: move files from one directory to another based on a specific line i have some text files in a directory i would like to read all the files in this directory and if the file has the following line sample an integer of type decimal can be assembled by move that file with its all contents to another directory how can i do this, 

How to Use Elasticsearch for Indexing and Retrieving Text

What is Elasticsearch?

Elasticsearch is an open source distributed, RESTful search and analytics engine. Elasticsearch enables us to index, search, and analyze data at large scale. It provides real-time search and analytics for various types of data including structured or unstructured text, numerical data, or geospatial data. Elasticsearch can efficiently store and index it in a way that supports fast searches. In order to learn Elasticsearch please see the documentation. It is out of the scope of this tutorial, so I leave it as an exercise to understand and learn how Elasticsearch works.

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
import json
import time

create_index[source]

create_index(es_client)

Creates an Elasticsearch index.

index_data[source]

index_data(es_client, data, BATCH_SIZE=100000)

Indexs all the rows in data (python questions).

index_batch[source]

index_batch(docs)

Indexes a batch of documents.

run_query_loop[source]

run_query_loop()

Asks user to enter a query to search.

handle_query[source]

handle_query()

Searches the user query and finds the best matches using elasticsearch.

INDEX_NAME = 'python_questions'
es_client = Elasticsearch()
create_index(es_client)
index_data(es_client, data)
Creating `Question` index...
index `Question` created successfully.
Indexed 100000 documents.
Indexed 200000 documents.
Indexed 300000 documents.
Indexed 400000 documents.
Indexed 500000 documents.
Indexed 600000 documents.
Indexed 607282 documents.
Done indexing.
SEARCH_SIZE = 3
run_query_loop()
{'size': 3, 'query': {'match': {'body': 'how to loop over files in a directory'}}}

10000 total hits.
search time: 17.38 ms
id: AnG2e3EBroreQxGKfg6x, score: 19.826128
{'body': 'looping over filenames in python i have a zillion files in a directory i want a script to run on they all have a filename like prefix_foo_ _asdf_asdfasdf csv i know how to loop over files in a directory using a variable in the filename in shell but not python is there a corresponding way to do something like i for i lt process py prefix_foo_ i_ i endloop'}

id: 8nW3e3EBroreQxGKGSnb, score: 19.142948
{'body': 'iterate and delete the files in a directory i am trying to iterate over a few files from a directory then i am copying them in a group based on their initial name to a particular location and then deleting them from the current direcory but since i delete them after grouping them together i get a file not found exception when the loop move over to the next file which is deleted how can i resolve it here is my code import os import csv import glob fnmatch shutil time ftp_directory c cirp velocidata test ifind_location c cirp velocidata test getting the list of all files for root dirs files in os walk ftp_directory filtering for group names that are inprogress groups_beingworked for name in files getting the exception in the below line in the second iteration group name name lower find infile for loop in files if loop len group lower group groups_beingworked append loop for loop in groups_beingworked shutil copy os path join root loop ifind_location print deleting the file loop file removed from the ftp location through a ftp conn created elsewhere ftpconn delete os path join root loop list of sample files i am trying to go through is file new infile type file new infile type file old infile type file new infile type file old infile type file rec infile type'}

id: fna3e3EBroreQxGKU39k, score: 18.82203
{'body': 'python opening images in the fabio module by looping over a directory i m attempting to deal with cbf crystallographic binary format see below for link files in python i need a way of looping over all the files in the current directory example reading in first file in fabio dat raw_input please input required filename define the required filename as a string example input file cbf import fabio import fabio module for python img_ fabio open dat open image from defined filename this section of the code designed to open and display a file works perfectly fabio has a method of opening the next file available which is in this case of the format example img_ img_ next as i have already defined img_ in example this code would work how would i loop over all the files in the current directory without needing to execute the command in example for every file if there were files would it be something of the form example for i in range img_ i img_ i next how can i do this loop whilst also accounting for the leading zeros any help would be greatly appreciated thanks relevant information cbf files http www esrf eu computing forum imgcif cbf_definition html fabio module http pythonhosted org fabio getting_started html'}