Reputation: 1

Using python to extract a column data out from Excel and tokenize it to perform distance measurement text mining

I'm doing a project that wants a result in similarity percentage, and I have decided to try 'distance measurement in text mining'. According to this guide, https://towardsdatascience.com/3-basic-distance-measurement-in-text-mining-5852becff1d7, I'm trying to extract all the data in a column into a variable in python using Anaconda so that I will be able to tokenize them into value. dataImage This image shows columns of mock data, but real data there will be thousands of rows and around 80-100 columns. There will also be a GUI for the user to type their inputs that will be used to compare with the data in the Excel. In the end, the result should look like this picture that I made up dataImageResult

The problems is: Data in NUMBERS can not be tokenzied when I tried. Although Excel may have formula function to compare the value, it doesn't calculate and return the percentage of similarity using formula from the guide I mentioned earlier. Whereas, letters data can be tokenized as supported by library (I'm using NLTK to tokenize and openpyxl for excel data extraction)

Also, is there any better method to get the same result? I'm open to all suggestions Thank you.

Upvotes: 0

Answers (1)

Anupam Misra

Reputation: 11

You can directly compare numbers
To compare sentences/ words you have to either tokenize them or create word embeddings. In the TDS article, the author had tokenized the text. It is better to create document embeddings using a pre-trained model and compare cosine similarity of those embeddings.

I did the work for you below:

import pandas as pd
import numpy as np

string_available1 = ['alpha','beta','gamma','zeta','lambda','delta']
string_available2 = ['a','b','c','d','e','f']

number1_available = [43,44,45,56,57,58]
number2_available = [3,4,5,6,7,8]

"""
The available dataframe is the data that you have having two numeric and two string columns

"""

available = pd.DataFrame({'data1':string_available1,'data2':string_available2, 'numbers1':number1_available, 'numbers2':number2_available})
string_input1 = 'alpha'
string_input2 = 'c'
number_input1 = 43
number_input2 = 5


"""
inputs is your input of the *same no. of features* as the available data.

"""

inputs = pd.DataFrame({'data1':[string_input1], 'data2':string_input2,'numbers1':[number_input1], 'numbers2':[number_input2]})


 
"""
1. Installing package for converting text to word embeddings. I am using Hugging Face BERT model here

2. Importing cosine distances for comparing word embedding similarity

3. StandardScalar because the numeric data will be in different ranges

bert is the pre-trained model I am importing. You may read more [here][1]

4. Bringing diff. numeric scales to the same range. You may comment out this part if you want to maintain the ranges

5. Making the list of available words in the text columns

"""

from sentence_transformers import SentenceTransformer  
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler
bert = SentenceTransformer('bert-large-nli-stsb-mean-tokens')

numeric_columns = ['numbers1', 'numbers2']
text_columns = ['data1','data2']

s = StandardScaler()
s = s.fit(available[numeric_columns])
available[numeric_columns]=s.transform(available[numeric_columns])

available_text = [[t for t in available[col].tolist()] for col in text_columns]

def compute(inputs):
    """
    1. Converting the input text fields of the incoming data into list of words

    2. Calculating Euclidean distance between the input and the available numeric data pairwise. Then doing a MinMax scaling to obtain % of similarity.

    3. Converting the text into word embeddings using BERT and calculating the cosine distance in a pairwise manner. Then doing a MinMax scaling to obtain the % of similarity

    4. Doing a weighted sum of the text column similarities and numeric column similarities to obtain the overall similarity.

    5. Return a matrix of similarity of all the rows with the input data
    """
    input_text = [[t for t in inputs[col].tolist()] for col in text_columns]

    num_dists = ((available[numeric_columns].values - inputs[numeric_columns].values)**2).sum(axis=1)
    num_dists = 1 - (num_dists - min(num_dists))/max(num_dists)

    

    text_sims = np.array([cosine_similarity(bert.encode(input_text[column_count]), bert.encode(available_text[column_count])) 
    for column_count in [cols for cols in np.arange(len(text_columns))]]).sum(axis=0)
    text_sims = (text_sims[0] - min(text_sims[0]))/max(text_sims[0])


    overall_sim = (num_dists*len(numeric_columns)+ text_sims * len(text_columns))/(len(text_columns) + len(numeric_columns))
    return overall_sim

Output of above code: [0.74293772, 0.56249175, 0.70312133, 0.49113234, 0.52342746, 0.51076918]

Upvotes: 1

Using python to extract a column data out from Excel and tokenize it to perform distance measurement text mining

Answers (1)

Related Questions