Reputation: 33
I am new to pandas and python in general - grateful for any direction you can provide!
I have a csv file with 4 columns. I am trying to group together rows where the first three columns are the same on all rows (Column A Row 1 = Column A Row 2, Column B Row 1 = Column B Row 2, and so on)
My data look like this:
phone_number state date description
1 9991112222 NJ 2015-05-14 Condo
2 9991112222 NJ 2015-05-14 Condo sales call
3 9991112222 NJ 2015-05-14 Apartment rental
4 6668885555 CA 2015-05-06 Apartment
5 6668885555 CA 2015-05-06 Apartment rental
6 4443337777 NJ 2015-05-14 condo
So in this data, rows 1, 2 and 3 would be in one group, and rows 4 and 5 would be in another group. Row 6 would not be in the group with 1, 2, and 3 because it has a different phone_number.
Then, for each row, I want to compare the string in the description column against each other description in that group using Levenshtein distance, and keep the rows where the descriptions are sufficiently similar.
"Condo" from row 1 would be compared to "Condo sales call" from row 2 and to "Apartment rental" in row 3. It would not be compared to "condo" from row 6.
In the end, the goal is to weed out rows where the description is not sufficiently similar to another description in the same group. Phrased differently, to print out all rows where description is at least somewhat similar to another (any other) description in that group. Ideal output:
phone_number state date description
1 9991112222 NJ 2015-05-14 Condo
2 9991112222 NJ 2015-05-14 Condo sales call
4 6668885555 CA 2015-05-06 Apartment
5 6668885555 CA 2015-05-06 Apartment rental
Row 6 does not print because it was never in a group. Row 3 doesn't print because "Apartment rental" is insufficiently similar to "Condo" or "Condo sales call"
This is the code I have so far. I can't tell if this is the best way to do it. And if I have done it right so far, I can't figure out how to print the full row of interest:
import Levenshtein
import itertools
import pandas as pd
test_data = pd.DataFrame.from_csv('phone_state_etc_test.csv', index_col=None)
for pn in test_data['phone_number']:
for dt in test_data['date']:
for st in test_data['state']:
for a, b in itertools.combinations(test_data[
(test_data['phone_number'] == pn) &
(test_data['state'] == st) &
(test_data['date'] == dt)
]
['description'], 2):
if Levenshtein.ratio(a,b) > 0.35:
print pn, "|", dt, "|", st, "|" #description
This prints a bunch of duplicates of these lines:
9991112222 | NJ | 2015-05-14 |
6668885555 | CA | 2015-05-06 |
But if I add description to the end of the print line, I get a
SyntaxError: invalid syntax
Any thoughts on how I can print the full row? Whether in pandas dataframe, or some other format, doesn't matter - I just need to output to csv.
Upvotes: 3
Views: 4718
Reputation: 8906
It seems form the data provided that you want to keep rows for which the first word in the description matches the most common first word for that group. If that's the case, you can do this:
test_data['description_root'] = test_data.str.split().str[0]
# this adds a columns with the first word from the description column
grouped = test_data.groupby(['phone_number', 'state', 'date'])
most_frequent_root = grouped.description_root.transform(
lambda s: s.value_counts().idxmax())
# this is a series with the same index as the original df containing
# the most frequently occuring root for each group
test_data[test_data.description_root == most_frequent_root]
# this will give you the matching rows
You could also call .describe
on grouped
to give some additional information for each group. Sorry if this is off topic but I think the you might well find the Series string methods (.str
) and the groupby useful.
Upvotes: 1
Reputation: 16144
Why don't you use the pandas.groupby
option to find the unique groups (based on phone-number, state and date). Doing this lets you treat all the Description
values separately and do whatever you want to do with them.
For example, I'll groupby with the above said columns and get the unique values for the Description
columns within this group -
In [49]: df.groupby(['phone_number','state','date']).apply(lambda v: v['description'].unique())
Out[49]:
phone_number state date
4443337777 NJ 2015-05-14 [condo]
6668885555 CA 2015-05-06 [Apartment, Apartment-rental]
9991112222 NJ 2015-05-14 [Condo, Condo-sales-call, Apartment-rental]
dtype: object
You can use any function within the apply
. More examples here - http://pandas.pydata.org/pandas-docs/stable/groupby.html
Upvotes: 4
Reputation: 60070
I'm not entirely sure how best to do a calculation for all pairs of values in pandas
- here I've made a matrix with the descriptions as both the rows and columns (so the main diagonal of the matrix compares the description with itself), but it doesn't seem entirely idiomatic:
def find_similar_rows(group, threshold=0.35):
sim_matrix = pd.DataFrame(index=group['description'],
columns=group['description'])
for d1 in sim_matrix.index:
for d2 in sim_matrix.columns:
# Leave diagonal entries as nan
if d1 != d2:
sim_matrix.loc[d1, d2] = Levenshtein.ratio(d1, d2)
keep = sim_matrix.gt(threshold, axis='columns').any()
# A bit of possibly unnecessary mucking around with the index
# here, could probably be cleaned up
rows_to_keep = group.loc[keep[group['description']].tolist(), :]
return rows_to_keep
grouped = test_data.groupby('phone_number', group_keys=False)
grouped.apply(find_similar_rows)
Out[64]:
phone_number state date description
4 6668885555 CA 2015-05-06 Apartment
5 6668885555 CA 2015-05-06 Apartment rental
1 9991112222 NJ 2015-05-14 Condo
2 9991112222 NJ 2015-05-14 Condo sales call
Upvotes: 1