Reputation: 13
I have two CSV files which I need to compare. The first one is called SAP.csv, and the second is SAPH.csv.
SAP.csv has these cells:
Notification Description
5000000001 Detailed Inspection of Masts (2100mm) (3
5000000002 Ceremonial Awnings-Survey and Load Test
5000000003 HPA-Carry out 4000 hour service routine
5000000004 UxE 8 in Number Temperature Probs for C
5000000005 Overhaul valves
...while, SAPH.csv has these cells:
Notification Description
4000000015 Detailed Inspection of Masts (2100mm) (3
4000000016 Ceremonial Awnings-Survey and Load Test
4000000017 HPA-Carry out 8000 hour service routine
4000000018 UxE 8 in Number Temperature Probs for C
4000000019 Represerve valves
4000000020 STW System
They are similar, but some lines, like the fourth, (HPA-Carry out 4000 hour service routine vs. HPA-Carry out 8000 hour service routine), are slightly different.
I want to compare each value of SAP.csv against every value of SAPH.csv, and, using cosine similarity, find the most similar lines, so that the output would look something like this (the similarity percentages here are just examples, not what they would actually be):
Description
Detailed Inspection of Masts (2100mm) (3 - 100%
Ceremonial Awnings-Survey and Load Test - 100%
HPA-Carry out 4000 hour service routine - 85%
UxE 8 in Number Temperature Probs for C - 90%
Overhaul valves - 0%
Post answer edit
runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py', wdir='C:/Users/andrew.stillwell2/.spyder-py3')
Traceback (most recent call last):
File "", line 1, in
runfile('C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py', wdir='C:/Users/andrew.stillwell2/.spyder-py3')
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py", line 31, in
similarity_score = similar(job, description) # Get their similarity
File "C:/Users/andrew.stillwell2/.spyder-py3/Estimating Test.py", line 14, in similar
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
File "C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\base.py", line 173, in distance
return self.maximum(*sequences) - self.similarity(*sequences)
File "C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\base.py", line 176, in similarity
return self(*sequences)
File "C:\ProgramData\Anaconda3\lib\site-packages\textdistance\algorithms\token_based.py", line 175, in call
return intersection / pow(prod, 1.0 / len(sequences))
ZeroDivisionError: float division by zero
2nd Edit because of solution to the above
So the original request had just two outputs - Description and Similairty score.
Description comes from SAP Similarity comes from the textdistance calc
Can the solution be ammended to the following
Notifcation (this is a 10 digit number which is in the SAP file) Description (as it currently is) Similarity (as it currently is) Notification (this number comes from the SAPH file and would be the one which provides the similarity score)
So an example row output would like this
80000115360 Additional Materials FWD Rope Guard 86.24% 7123456789
This would be along columns A, B, C, D
A, B comes from SAP C is calculated D comes from SAPH
Edit 3
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/andrew.stillwell2/.spyder-py3/Est Test 2.py", line 16, in
SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'})
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 429, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 895, in init
self._make_engine(self.engine)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1122, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1853, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 490, in pandas._libs.parsers.TextReader.cinit
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py", line 2017, in pandas_dtype
dtype))
TypeError: data type 'string' not understood
Post edit 4 - 25/10/20
Hi, so getting the same error as before I think
This email may contain proprietary information of BAE Systems and/or third parties.
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/andrew.stillwell2/.spyder-py3/Est Test 2.py", line 16, in
SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 435, in _read
data = parser.read(nrows)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 2421, in read
data = self._convert_data(data)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 2487, in _convert_data
clean_conv, clean_dtypes)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1705, in _convert_to_ndarrays
cvals = self._cast_types(cvals, cast_type, c)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1808, in _cast_types
copy=True, skipna=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py", line 623, in astype_nansafe
dtype = pandas_dtype(dtype)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\dtypes\common.py", line 2017, in pandas_dtype
dtype))
TypeError: data type 'string' not understood
I picked up on your bit about the delimiter so I uploaded a csv file to repl.it and it looks as though "," is the delimiter.
Therefore have altered the code to suit. When I did that on repl.it it worked.
This is the code I am using
import textdistance
import pandas as pd
def similar(a, b): # adapted from here: https://stackoverflow.com/a/63838615/8402369
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
return similarity * 100
SAP = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")
SAPH = pd.read_csv('H:\Documents/Python/Import into Python/SAP/SAP_History.csv', dtype={'Notification':'string'}, delimiter=",", engine="python")
scores = pd.DataFrame(SAP['Description'], columns = ['Notification (SAP)','Description', 'Similarity', 'Notification (SAPH)'])
highest_score = 0
desc = 0
for job in SAP['Description']:
highest_score = 0 # Reset highest_score in each iteration
for description in SAPH['Description']: # Iterate through SAPH['Description']
similarity_score = similar(job, description) # Get their similarity
if(similarity_score > highest_score): # Check if the similarity is higher than the already saved similarity. If so, update highest_score with the new values
highest_score = similarity_score
desc = str(description)
if(similarity_score == 100): # If it's a perfect match, don't bother continuing to search.
break
scores['Notification (SAP)'][scores['Description'] == job] = SAP['Notification'][SAP['Description'] == job]
scores['Similarity'][scores['Description'] == job] = f'{highest_score}%'
scores['Notification (SAPH)'][scores['Description'] == job] = SAPH['Notification'][SAPH['Description'] == desc]
print(scores)
with open('./Scores.csv', 'w') as file:
file.write(scores.__repr__())
Which is being run on Spyder (Python 3.7)
Upvotes: 0
Views: 1148
Reputation: 2660
@George_Pipas's answer to this question demonstrates an example using the library textdistance
(I'm paraphrasing part of his answer here):
A solution is to work with the
textdistance
library. I will provide an example ofCosine Similarity
import textdistance 1-textdistance.Cosine(qval=2).distance('Apple', 'Appel')
and we get:
0.5
So, we can create a similarity finding function:
def similar(a, b):
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
return similarity
Depending on the similarity, this'll output a number closer to 1, if a
and b
are more similar, and it'll output a number closer to 0 if they aren't. So if a === b
, the output will be 1
, but if a !== b
, the output will be less than 1.
To get percentages, you just need to multiply the output by 100. Like this:
def similar(a, b): # adapted from here: https://stackoverflow.com/a/63838615/8402369
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
return similarity * 100
CSV files can be read pretty easily with pandas
:
# Read the CSVs
SAP = pd.read_csv('SAP.csv')
SAPH = pd.read_csv('SAPH.csv')
We create another pandas dataframe to store the results we'll compute in:
# Create a pandas dataframe to store the output. The column 'SAP' is populated with the values of SAP['Description']
scores = pd.DataFrame({'SAP': SAP['Description']}, columns = ['SAP', 'SAPH', 'Similarity'])
Now, we iterate through SAP['Description']
and SAPH['Description']
, compare each element against each other element, compute their similarity, and save the highest to scores
.
# Temporary variable to store both the highest similarity score, and the 'SAPH' value the score was computed with
highest_score = {"score": 0, "description": ""}
# Iterate though SAP['Description']
for job in SAP['Description']:
highest_score = {"score": 0, "description": ""} # Reset highest_score at each iteration
for description in SAPH['Description']: # Iterate through SAPH['Description']
similarity_score = similar(job, description) # Get their similarity
if(similarity_score > highest_score['score']): # Check if the similarity is higher than the already saved similarity. If so, update highest_score with the new values
highest_score['score'] = similarity_score
highest_score['description'] = description
if(similarity_score == 100): # If it's a perfect match, don't bother continuing to search.
break
# Update the dataframe 'scores' with highest_score
scores['SAPH'][scores['SAP'] == job] = highest_score['description']
scores['Similarity'][scores['SAP'] == job] = highest_score['score']
Here's a breakdown:
highest_score
is created to store, well, the highest computed scores.SAP['Description']
, and within, iterate though SAPH['Description']
. This allows us to compare each value of SAP['Description']
(job
) to every value of SAPH['Description']
(description
).SAPH['Description']
, we:
job
and description
highest_score
, we update highest_score
accordingly; otherwise we continuesimilarity_score
is equal to 100
, we know that it's a perfect match, and don't have to keep looking. We break the loop in this case.SAPH['Description']
loop, now that we've compared job
to each element of SAPH['Description']
, (or found a perfect match), we save the values to scores
.This repeats for every element of SAP['Description']
.
Here's what scores
looks like when it's finished:
SAP SAPH Similarity
0 Detailed Inspection of Masts (2100mm) (3 Detailed Inspection of Masts (2100mm) (3 100
1 Ceremonial Awnings-Survey and Load Test Ceremonial Awnings-Survey and Load Test 100
2 HPA-Carry out 4000 hour service routine HPA-Carry out 8000 hour service routine 94.7368
3 UxE 8 in Number Temperature Probs for C UxE 8 in Number Temperature Probs for C 100
4 Overhaul valves Represerve valves 53.4522
And after outputting it to a CSV file with this:
# Output it to Scores.csv without the index column (0, 1, 2, 3... far left in scores above). Remove index=False if you want to keep the index column.
scores.to_csv('Scores.csv', index=False)
...Scores.csv looks like this:
SAP,SAPH,Similarity
Detailed Inspection of Masts (2100mm) (3,Detailed Inspection of Masts (2100mm) (3,100
Ceremonial Awnings-Survey and Load Test,Ceremonial Awnings-Survey and Load Test,100
HPA-Carry out 4000 hour service routine,HPA-Carry out 8000 hour service routine,94.73684210526315
UxE 8 in Number Temperature Probs for C,UxE 8 in Number Temperature Probs for C,100
Overhaul valves,Represerve valves,53.45224838248488
View the full code, and run and edit it online
Note that textdistance
and pandas
are required libraries for this. Install them, if you don't have them already, with:
pip install textdistance pandas
Notes:
f'{highest_score}%'
with this: f'{round(highest_score, NUMBER_OF_PLACES_TO_ROUND_TO)}%'
EDIT: (for the problems encountered that are mentioned in the comments)
Here is an error-catching version of the similarity function:
def similar(a, b): # adapted from here: https://stackoverflow.com/a/63838615/8402369
try:
similarity = 1-textdistance.Cosine(qval=2).distance(a, b)
return similarity * 100
except ZeroDivisionError:
print('There was an error. Here are the values of a and b that were passed')
print(f'a: {repr(a)}')
print(f'b: {repr(b)}')
exit()
Upvotes: 2