Reputation: 338
I've got a Dataframe (deriving from a csv file with various columns) with 172033 rows. I've created a custom indexing function that blocks pairs of records that haven't got similar 'name' attributes. The problem resides in the efficiency of the algorithm. Just to get to the 10th iteration it takes about a minute. Therefore indexing the whole dataset would take way too much time. How can I make my algorithm more efficient?
class CustomIndex(BaseIndexAlgorithm):
def _link_index(self, df_a, df_b):
indici1=[]
indici2=[]
for i in range(0, 173033):
if(i%2 == 0):
print(i) #keeps track of the iteration
for j in range(i, 173033):
if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.5):
indici1.append(i)
indici2.append(j)
indici = [indici1, indici2]
return pd.MultiIndex.from_arrays(indici, names=('first', 'second'))
I want to obtain a MultiIndex object, which would be an array of tuples contains the indexes of the pairs of records which are similar enough to not be blocked.
[MultiIndex([( 0, 0),
( 0, 22159),
( 0, 67902),
( 0, 67903),
( 1, 1),
( 1, 1473),
( 1, 5980),
( 1, 123347),
( 2, 2),
...
Here's the code for the similarity function:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
Here's an example of the dataframe I have as input:
name
0 Amazon
1 Walmart
2 Apple
3 Amazon.com
4 Walmart Inc.
I would like the resulting MultiIndex to contain tuple links between 0 and 3, 1 and 4 and all the repetitions (0 and 0, 1 and 1 etc.)
Upvotes: 0
Views: 244
Reputation: 338
In the end this is what I came up. After sorting the dataset in alphabetical order according to the attribute 'name', I use this custom index. Every row in the dataset is compared to its neighbours in the range of 100 rows.
class CustomIndex(BaseIndexAlgorithm):
def _link_index(self, df_a, df_b):
t0 = time.time()
indici1=[]
indici2=[]
for i in range(1, 173034):
if(i%500 == 0):
print(i)
if(i<100):
n = i
for j in range((i-(n-1)), (i+100)):
if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.35):
indici1.append(i)
indici2.append(j)
elif(i>172932):
n = 173033 - i
for j in range((i-100), (i+n)):
if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.35):
indici1.append(i)
indici2.append(j)
else:
for j in range((i-100), (i+100)):
if(similar(df_a.loc[i, 'name'], df_a.loc[j, 'name'])>0.35):
indici1.append(i)
indici2.append(j)
indici = [indici1, indici2]
t1 = time.time()
print(t1-t0)
return pd.MultiIndex.from_arrays(indici, names=('first', 'second'))
Indexing the dataset takes about 20 minutes on my machine which is great compared to before. Thanks everyone for the help!
I'll be looking into locality sensitive hashing suggested by Pierre D.
Upvotes: 0
Reputation: 18141
As others have pointed out, the solution to your problem requires O(N^2) running time, which means it won't scale well for very large datasets. Nonetheless, I think there's still a lot of room for improvement.
Here are some strategies you can use to speed up your code:
If your dataset contains many duplicate name
values, you can use "memoization" to avoid re-computing the similar
score for duplicate name pairs. Of course, caching all 172k^2 pairs would be devastatingly expensive, but if the data is pre-sorted by name, then lru_cache
with 172k items should work just fine.
Looking at the difflib
documentation, it appears that you have the option of quickly filtering out "obvious" mismatches. If you expect most pairs to be "easy" to eliminate from consideration, then it makes sense to first call SequenceMatcher.quick_ratio()
(or even real_quick_ratio()
), followed by ratio()
only if necessary.
There will be some overhead in the ordinary control flow.
df.loc
many times in a for-loop might be a bit slow in comparison to simple iteration.itertools.combinations
to avoid writing a nested for-loop yourself.tqdm
provides a convenient progress bar, which will give a better indication of true progress than the print
statements in your code.Lastly, I saw no need for the df_b
parameter in your function above, so I didn't include it in the code below. Here's the full solution:
import pandas as pd
from difflib import SequenceMatcher
from functools import lru_cache
from itertools import combinations
from tqdm import tqdm
@lru_cache(173_000)
def is_similar(a, b):
matcher = SequenceMatcher(None, a, b)
if matcher.quick_ratio() <= 0.5:
return False
return matcher.ratio() > 0.5
def link_index(df):
# We initialize the index result pairs with [(0,0), (1,1,), (2,2), ...]
# because they are trivially "linked" and your problem statement
# says you want them in the results.
indici1 = df.index.tolist()
indici2 = df.index.tolist()
# Sort the names so that our lru_cache is effective,
# even though it is limited to 173k entries.
name_items = df['name'].sort_values().items()
pairs = combinations(name_items, 2)
num_pairs = math.comb(len(names), 2)
for (i, i_name), (j, j_name) in tqdm(pairs, total=num_pairs):
if is_similar(i_name, j_name):
indici1.append(i)
indici2.append(j)
indici = [indici1, indici2]
links = pd.MultiIndex.from_arrays(indici, names=('first', 'second'))
return links.sortlevel([0,1])[0]
Quick Test:
names = ['Amazon', 'Walmart', 'Apple', 'Amazon.com', 'Walmart Inc.']
df = pd.DataFrame({'name': names})
link_index(df)
Output:
(MultiIndex([(0, 0),
(0, 3),
(1, 1),
(1, 4),
(2, 2),
(3, 3),
(4, 4)],
names=['first', 'second']),
array([0, 5, 1, 6, 2, 3, 4]))
Let me know if that speeds things up on your actual data!
Let's set some realistic expectations.
Your original estimate was ~1 minute for 10 "iterations". That implies the total time would have been ~6 days:
print(math.comb(172033, 2) / (10*172033) / 60 / 24)
On the other hand, merely iterating through the full set of i,j
combinations and doing absolutely nothing with them would take ~45 minutes on my machine. See for yourself:
sum(1 for _ in tqdm(combinations(np.arange(172033), 2), total=math.comb(172033, 2)))
So the real solution will take longer than that. Now you've got some bounds on what the optimal solution will require: Somewhere between ~1 hour and ~6 days. Hopefully it's closer to the former!
Upvotes: 1
Reputation: 20450
We asked you to reveal the similar()
function but you've declined to do so.
You wrote
for i in range(0, 173033):
...
for j in range(i, 173033):
if similar(df_a.loc[i, 'name'], df_a.loc[j, 'name']) > 0.5:
So you plan to call that mysterious similar
29_940_419_089 (30 billion) times.
I'm guessing that's going to take some little while, maybe two weeks?
We describe this as O(n^2) quadratic performance.
Here is the way out.
First, sort your dataset. It will only cost O(n log n), cheap!
Next, find something in the similar
loss function that allows you to
simplify the problem, or design a new related loss function which allows that.
Spoiler alert -- we can only do comparisons against local neighbors,
not against far flung entries more than 100,000 positions away.
I will focus on the five example words.
It seems likely that your mysterious function reports
a large value for similar("Walmart", "Walmart Inc.")
and a small value when comparing those strings against "Apple"
.
So let's adopt a new cost function rule.
If initial character of the two strings is not identical,
the similarity is immediately reported as 0
.
So now we're faced with comparing Apple against two Amazons, and Walmart with itself. We have partitioned the problem into "A" and "W" subsets. The quadratic monster is starting to die.
Minor side note: Since your similarity function likely is
symmetric, it suffices to examine just half the square,
where i < j
.
A practical implementation for a dataset of this size is going to need to be more aggressive, perhaps insisting that initial 2 or 3 letters be identical. Perhaps you can run a hyphenation algorithm and look at identical syllables. Or use Metaphone or Soundex.
The simplest possible thing you could do is define a
window W
of maybe 10. When examining the sorted dataset,
we arbitrarily declare that similarity between i
, j
entries
shall be zero if abs(i - j) > W
.
This may impact accuracy, but it's great for performance,
it lets you prune the search space before you even call the function.
We went from O(n^2) to O(n) linear.
Perfect accuracy is hardly relevant if you never wait
long enough for the code to produce an answer.
Use these ideas to code up a solution, and let us know how you resolved the details.
EDIT
@Pierre D. advocates the use of locality sensitive hashing to avoid lost revenue due to {prefix1}{entity} and {prefix2}{entity} being far apart in a sorted list yet close together IRL and in hash space. Metaphone is one technique for inducing LSH collisions, but it seldom has any effect on initial letter / sort order. Das, et al., describe a MinHash technique in "Google News Personalization: Scalable Online Collaborative Filtering". Number of syllables in the business names you're examining will be significantly less than size of click-sets seen by Google. Hashes give binary results of hit or miss, rather than a smooth ranking or distance metric. The need to promote neighbor collisions, while steering clear of the birthday paradox, makes it difficult to recommend tuning parameters absent a snapshot of the production dataset. Still, it is a technique you should keep in mind.
Asking spacy to do a named entity recognition pre-processing pass over your data might also be worth your while, in the hopes of normalizing or discarding any troublesome "noise" prefixes.
Upvotes: 0
Reputation: 36390
You are using .append
method of list
, that method according to PythonWiki is O(1) but Individual actions may take surprisingly long, depending on the history of the container.. You might use collections.deque
which does have such quirks, just add import collections
and do
indici1=collections.deque()
indici2=collections.deque()
...
indici = [list(indici1), list(indici2)]
If that would not help enough you would need similar
function for possible improvements.
Upvotes: 1