forgot_my_towel
forgot_my_towel

Reputation: 33

Iterate over groups of rows in Pandas

I am new to pandas and python in general - grateful for any direction you can provide!

I have a csv file with 4 columns. I am trying to group together rows where the first three columns are the same on all rows (Column A Row 1 = Column A Row 2, Column B Row 1 = Column B Row 2, and so on)

My data look like this:

   phone_number  state   date         description
1  9991112222    NJ      2015-05-14   Condo
2  9991112222    NJ      2015-05-14   Condo sales call
3  9991112222    NJ      2015-05-14   Apartment rental
4  6668885555    CA      2015-05-06   Apartment
5  6668885555    CA      2015-05-06   Apartment rental
6  4443337777    NJ      2015-05-14   condo

So in this data, rows 1, 2 and 3 would be in one group, and rows 4 and 5 would be in another group. Row 6 would not be in the group with 1, 2, and 3 because it has a different phone_number.

Then, for each row, I want to compare the string in the description column against each other description in that group using Levenshtein distance, and keep the rows where the descriptions are sufficiently similar.

"Condo" from row 1 would be compared to "Condo sales call" from row 2 and to "Apartment rental" in row 3. It would not be compared to "condo" from row 6.

In the end, the goal is to weed out rows where the description is not sufficiently similar to another description in the same group. Phrased differently, to print out all rows where description is at least somewhat similar to another (any other) description in that group. Ideal output:

   phone_number  state   date         description
1  9991112222    NJ      2015-05-14   Condo
2  9991112222    NJ      2015-05-14   Condo sales call
4  6668885555    CA      2015-05-06   Apartment
5  6668885555    CA      2015-05-06   Apartment rental

Row 6 does not print because it was never in a group. Row 3 doesn't print because "Apartment rental" is insufficiently similar to "Condo" or "Condo sales call"

This is the code I have so far. I can't tell if this is the best way to do it. And if I have done it right so far, I can't figure out how to print the full row of interest:

import Levenshtein
import itertools 
import pandas as pd

test_data = pd.DataFrame.from_csv('phone_state_etc_test.csv', index_col=None)

for pn in test_data['phone_number']:
    for dt in test_data['date']:
        for st in test_data['state']:
            for a, b in itertools.combinations(test_data[
                                                     (test_data['phone_number'] == pn) & 
                                                     (test_data['state'] == st) & 
                                                     (test_data['date'] == dt)
                                                    ]
                                                     ['description'], 2):
                if Levenshtein.ratio(a,b) > 0.35:
                    print pn, "|", dt, "|", st, "|" #description

This prints a bunch of duplicates of these lines:

9991112222 | NJ | 2015-05-14 |
6668885555 | CA | 2015-05-06 |

But if I add description to the end of the print line, I get a

SyntaxError: invalid syntax 

Any thoughts on how I can print the full row? Whether in pandas dataframe, or some other format, doesn't matter - I just need to output to csv.

Upvotes: 3

Views: 4718

Answers (3)

JoeCondron
JoeCondron

Reputation: 8906

It seems form the data provided that you want to keep rows for which the first word in the description matches the most common first word for that group. If that's the case, you can do this:

test_data['description_root'] = test_data.str.split().str[0] 
# this adds a columns with the first word from the description column

grouped = test_data.groupby(['phone_number', 'state', 'date'])
most_frequent_root = grouped.description_root.transform(
          lambda s: s.value_counts().idxmax())

# this is a series with the same index as the original df containing 
# the most frequently occuring root for each group

test_data[test_data.description_root == most_frequent_root]
# this will give you the matching rows

You could also call .describe on grouped to give some additional information for each group. Sorry if this is off topic but I think the you might well find the Series string methods (.str) and the groupby useful.

Upvotes: 1

fixxxer
fixxxer

Reputation: 16144

Why don't you use the pandas.groupby option to find the unique groups (based on phone-number, state and date). Doing this lets you treat all the Description values separately and do whatever you want to do with them.

For example, I'll groupby with the above said columns and get the unique values for the Description columns within this group -

In [49]: df.groupby(['phone_number','state','date']).apply(lambda v: v['description'].unique())
Out[49]: 
phone_number  state  date      
4443337777    NJ     2015-05-14                                        [condo]
6668885555    CA     2015-05-06                  [Apartment, Apartment-rental]
9991112222    NJ     2015-05-14    [Condo, Condo-sales-call, Apartment-rental]
dtype: object

You can use any function within the apply. More examples here - http://pandas.pydata.org/pandas-docs/stable/groupby.html

Upvotes: 4

Marius
Marius

Reputation: 60070

I'm not entirely sure how best to do a calculation for all pairs of values in pandas- here I've made a matrix with the descriptions as both the rows and columns (so the main diagonal of the matrix compares the description with itself), but it doesn't seem entirely idiomatic:

def find_similar_rows(group, threshold=0.35):
    sim_matrix = pd.DataFrame(index=group['description'], 
                              columns=group['description'])
    for d1 in sim_matrix.index:
        for d2 in sim_matrix.columns:
            # Leave diagonal entries as nan
            if d1 != d2:
                sim_matrix.loc[d1, d2] = Levenshtein.ratio(d1, d2)

    keep = sim_matrix.gt(threshold, axis='columns').any()
    # A bit of possibly unnecessary mucking around with the index
    #   here, could probably be cleaned up
    rows_to_keep = group.loc[keep[group['description']].tolist(), :]
    return rows_to_keep

grouped = test_data.groupby('phone_number', group_keys=False)

grouped.apply(find_similar_rows)
Out[64]: 
   phone_number state        date       description
4    6668885555    CA  2015-05-06         Apartment
5    6668885555    CA  2015-05-06  Apartment rental
1    9991112222    NJ  2015-05-14             Condo
2    9991112222    NJ  2015-05-14  Condo sales call

Upvotes: 1

Related Questions