Compare each element of CSV file to every element of a different CSV file, and find the most similar elements

Question

I have two CSV files which I need to compare. The first one is called SAP.csv, and the second is SAPH.csv.

SAP.csv has these cells:

Notification    Description
5000000001      Detailed Inspection of Masts (2100mm) (3
5000000002      Ceremonial Awnings-Survey and Load Test
5000000003      HPA-Carry out 4000 hour service routine
5000000004      UxE 8 in Number Temperature Probs for C
5000000005      Overhaul valves

...while, SAPH.csv has these cells:

Notification   Description
4000000015     Detailed Inspection of Masts (2100mm) (3
4000000016     Ceremonial Awnings-Survey and Load Test
4000000017     HPA-Carry out 8000 hour service routine
4000000018     UxE 8 in Number Temperature Probs for C
4000000019     Represerve valves
4000000020     STW System

They are similar, but some lines, like the fourth, (HPA-Carry out 4000 hour service routine vs. HPA-Carry out 8000 hour service routine), are slightly different.

I want to compare each value of SAP.csv against every value of SAPH.csv, and, using cosine similarity, find the most similar lines, so that the output would look something like this (the similarity percentages here are just examples, not what they would actually be):

Description
Detailed Inspection of Masts (2100mm) (3 - 100%
Ceremonial Awnings-Survey and Load Test  - 100%
HPA-Carry out 4000 hour service routine  - 85%
UxE 8 in Number Temperature Probs for C  - 90%
Overhaul valves                          - 0%

Post answer edit

runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py', wdir='C:/Users/andrew.stillwell2/.spyder-py3')

Traceback (most recent call last):

File "", line 1, in

runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py', wdir='C:/Users/andrew.stillwell2/.spyder-py3')

File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile

execfile(filename, namespace)

File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile

exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py", line 31, in

similarity_score = similar(job, description) # Get their similarity

File "C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py", line 14, in similar

similarity = 1-textdistance.Cosine(qval=2).distance(a, b)

File "C:\ProgramData\Anaconda3\lib\site-packages extdistance\algorithms\base.py", line 173, in distance

return self.maximum(*sequences) - self.similarity(*sequences)

File "C:\ProgramData\Anaconda3\lib\site-packages extdistance\algorithms\base.py", line 176, in similarity

return self(*sequences)

File "C:\ProgramData\Anaconda3\lib\site-packages extdistance\algorithms oken_based.py", line 175, in call

return intersection / pow(prod, 1.0 / len(sequences))

ZeroDivisionError: float division by zero

2nd Edit because of solution to the above

So the original request had just two outputs - Description and Similairty score.

Description comes from SAP Similarity comes from the textdistance calc

Can the solution be ammended to the following

Notifcation (this is a 10 digit number which is in the SAP file) Description (as it currently is) Similarity (as it currently is) Notification (this number comes from the SAPH file and would be the one which provides the similarity score)

So an example row output would like this

80000115360 Additional Materials FWD Rope Guard 86.24% 7123456789

This would be along columns A, B, C, D

A, B comes from SAP C is calculated D comes from SAPH

Edit 3

File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile

execfile(filename, namespace)

File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile

exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/andrew.stillwell2/.spyder-py3/Est Test 2.py", line 16, in

SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'})

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f

return _read(filepath_or_buffer, kwds)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 429, in _read

parser = TextFileReader(filepath_or_buffer, **kwds)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 895, in init

self._make_engine(self.engine)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1122, in _make_engine

self._engine = CParserWrapper(self.f, **self.options)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1853, in init

self._reader = parsers.TextReader(src, **kwds)

File "pandas/_libs/parsers.pyx", line 490, in pandas._libs.parsers.TextReader.cinit

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py", line 2017, in pandas_dtype

dtype))

TypeError: data type 'string' not understood

Post edit 4 - 25/10/20

Hi, so getting the same error as before I think

This email may contain proprietary information of BAE Systems and/or third parties.

File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile

execfile(filename, namespace)

File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile

exec(compile(f.read(), filename, 'exec'), namespace)

File "C:/Users/andrew.stillwell2/.spyder-py3/Est Test 2.py", line 16, in

SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f

return _read(filepath_or_buffer, kwds)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 435, in _read

data = parser.read(nrows)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1139, in read

ret = self._engine.read(nrows)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 2421, in read

data = self._convert_data(data)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 2487, in _convert_data

clean_conv, clean_dtypes)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1705, in _convert_to_ndarrays

cvals = self._cast_types(cvals, cast_type, c)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1808, in _cast_types

copy=True, skipna=True)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 623, in astype_nansafe

dtype = pandas_dtype(dtype)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py", line 2017, in pandas_dtype

dtype))

TypeError: data type 'string' not understood

I picked up on your bit about the delimiter so I uploaded a csv file to repl.it and it looks as though "," is the delimiter.

Therefore have altered the code to suit. When I did that on repl.it it worked.

This is the code I am using

import textdistance

import pandas as pd

def similar(a, b): # adapted from here: https://stackoverflow.com/a/63838615/8402369

similarity = 1-textdistance.Cosine(qval=2).distance(a, b)

return similarity * 100

Read the CSVs

SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")

SAPH = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP_History.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")

Create a pandas dataframe to store the output. The column 'Description' is populated with the values of SAP['Description']

scores = pd.DataFrame(SAP['Description'], columns = ['Notification (SAP)','Description', 'Similarity', 'Notification (SAPH)'])

Temporary variable to store the highest similarity score

highest_score = 0

desc = 0

Iterate though SAP['Description']

for job in SAP['Description']:

highest_score = 0 # Reset highest_score in each iteration

for description in SAPH['Description']: # Iterate through SAPH['Description']

similarity_score = similar(job, description) # Get their similarity



if(similarity_score > highest_score): # Check if the similarity is higher than the already saved similarity. If so, update highest_score with the new values

  highest_score = similarity_score

  desc = str(description)

if(similarity_score == 100): # If it's a perfect match, don't bother continuing to search.

  break

Update the dataframe 'scores' with highest_score and other values

print(SAPH['Description'][SAPH['Description'] == desc])

scores['Notification (SAP)'][scores['Description'] == job] = SAP['Notification'][SAP['Description'] == job]

scores['Similarity'][scores['Description'] == job] = f'{highest_score}%'

scores['Notification (SAPH)'][scores['Description'] == job] = SAPH['Notification'][SAPH['Description'] == desc]

print(scores)

Output it to Scores.csv without the index column

with open('./Scores.csv', 'w') as file:

file.write(scores.__repr__())

Which is being run on Spyder (Python 3.7)

marsnebulasoup · Accepted Answer

@George_Pipas's answer to this question demonstrates an example using the library textdistance (I'm paraphrasing part of his answer here):

A solution is to work with the textdistance library. I will provide an example of Cosine Similarity
import textdistance
1-textdistance.Cosine(qval=2).distance('Apple', 'Appel')
and we get:
0.5

So, we can create a similarity finding function:

def similar(a, b):
    similarity = 1-textdistance.Cosine(qval=2).distance(a, b)     
    return similarity

Depending on the similarity, this'll output a number closer to 1, if a and b are more similar, and it'll output a number closer to 0 if they aren't. So if a === b, the output will be 1, but if a !== b, the output will be less than 1.

To get percentages, you just need to multiply the output by 100. Like this:

def similar(a, b): # adapted from here: https://stackoverflow.com/a/63838615/8402369
    similarity = 1-textdistance.Cosine(qval=2).distance(a, b) 
    return similarity * 100

CSV files can be read pretty easily with pandas:

# Read the CSVs
SAP = pd.read_csv('SAP.csv') 
SAPH = pd.read_csv('SAPH.csv')

We create another pandas dataframe to store the results we'll compute in:

# Create a pandas dataframe to store the output. The column 'SAP' is populated with the values of SAP['Description']
scores = pd.DataFrame({'SAP': SAP['Description']}, columns = ['SAP', 'SAPH', 'Similarity'])

Now, we iterate through SAP['Description'] and SAPH['Description'], compare each element against each other element, compute their similarity, and save the highest to scores.

# Temporary variable to store both the highest similarity score, and the 'SAPH' value the score was computed with
highest_score = {"score": 0, "description": ""}

# Iterate though SAP['Description']
for job in SAP['Description']:
  highest_score = {"score": 0, "description": ""} # Reset highest_score at each iteration
  for description in SAPH['Description']: # Iterate through SAPH['Description']
    similarity_score = similar(job, description) # Get their similarity

    if(similarity_score > highest_score['score']): # Check if the similarity is higher than the already saved similarity. If so, update highest_score with the new values
      highest_score['score'] = similarity_score
      highest_score['description'] = description
    if(similarity_score == 100): # If it's a perfect match, don't bother continuing to search.
      break
  # Update the dataframe 'scores' with highest_score
  scores['SAPH'][scores['SAP'] == job] = highest_score['description'] 
  scores['Similarity'][scores['SAP'] == job] = highest_score['score']

Here's a breakdown:

A temporary variable, highest_score is created to store, well, the highest computed scores.
Now we iterate thorough SAP['Description'], and within, iterate though SAPH['Description']. This allows us to compare each value of SAP['Description'] (job) to every value of SAPH['Description'] (description).
While iterating though SAPH['Description'], we:
1. Compute the similarity score of both job and description
2. If it's higher than the saved score in highest_score, we update highest_score accordingly; otherwise we continue
3. If similarity_score is equal to 100, we know that it's a perfect match, and don't have to keep looking. We break the loop in this case.
Outside of the SAPH['Description'] loop, now that we've compared job to each element of SAPH['Description'], (or found a perfect match), we save the values to scores.

This repeats for every element of SAP['Description'].

Here's what scores looks like when it's finished:

                                        SAP                                      SAPH Similarity
0  Detailed Inspection of Masts (2100mm) (3  Detailed Inspection of Masts (2100mm) (3        100
1   Ceremonial Awnings-Survey and Load Test   Ceremonial Awnings-Survey and Load Test        100
2   HPA-Carry out 4000 hour service routine   HPA-Carry out 8000 hour service routine    94.7368
3   UxE 8 in Number Temperature Probs for C   UxE 8 in Number Temperature Probs for C        100
4                           Overhaul valves                         Represerve valves    53.4522

And after outputting it to a CSV file with this:

# Output it to Scores.csv without the index column (0, 1, 2, 3... far left in scores above). Remove index=False if you want to keep the index column.
scores.to_csv('Scores.csv', index=False)

...Scores.csv looks like this:

SAP,SAPH,Similarity
Detailed Inspection of Masts (2100mm) (3,Detailed Inspection of Masts (2100mm) (3,100
Ceremonial Awnings-Survey and Load Test,Ceremonial Awnings-Survey and Load Test,100
HPA-Carry out 4000 hour service routine,HPA-Carry out 8000 hour service routine,94.73684210526315
UxE 8 in Number Temperature Probs for C,UxE 8 in Number Temperature Probs for C,100
Overhaul valves,Represerve valves,53.45224838248488

View the full code, and run and edit it online

Note that textdistance and pandas are required libraries for this. Install them, if you don't have them already, with:

pip install textdistance pandas

Notes:

You can round the percent by replacing f'{highest_score}%' with this: f'{round(highest_score, NUMBER_OF_PLACES_TO_ROUND_TO)}%'
Here's a formatted version, and here's the code

EDIT: (for the problems encountered that are mentioned in the comments)

Here is an error-catching version of the similarity function:

def similar(a, b): # adapted from here: https://stackoverflow.com/a/63838615/8402369
  try: 
    similarity = 1-textdistance.Cosine(qval=2).distance(a, b) 
    return similarity * 100
  except ZeroDivisionError:
    print('There was an error. Here are the values of a and b that were passed')
    print(f'a: {repr(a)}')
    print(f'b: {repr(b)}')
    exit()

Compare each element of CSV file to every element of a different CSV file, and find the most similar elements

Read the CSVs

Create a pandas dataframe to store the output. The column 'Description' is populated with the values of SAP['Description']

Temporary variable to store the highest similarity score

Iterate though SAP['Description']

Update the dataframe 'scores' with highest_score and other values

print(SAPH['Description'][SAPH['Description'] == desc])

Output it to Scores.csv without the index column

Answers (1)

Related Questions