watchtower
watchtower

Reputation: 4298

Row-wise difference in two list in pandas

I am using pandas to incrementally find out new elements i.e. for every row, I'd see whether values in list have been seen before. If they are, we will ignore them. If not, we will select them.

I was able to do this using row.iterrows(), but I have >1M rows, so I believe vectorized apply might be better.

Here's sample data and code. Once you run this code, you will get expected output:

from numpy import nan as NA
import collections

df = pd.DataFrame({'ID':['A','B','C','A','B','A','A','A','D','E','E','E'],
                   'Value': [1,2,3,4,3,5,2,3,7,2,3,9]})
#wrap all elements by group in a list
Changed_df=df.groupby('ID')['Value'].apply(list).reset_index() 
Changed_df=Changed_df.rename(columns={'Value' : 'Elements'})
Changed_df=Changed_df.reset_index(drop=True)



def flatten(l):
    for el in l:
        if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
            yield from flatten(el)
        else:
            yield el

Changed_df["Elements_s"]=Changed_df['Elements'].shift()

#attempt 1: For loop
Changed_df["Diff"]=NA
Changed_df["count"]=0
Elements_so_far = []

#replace NA with empty list in columns that will go through list operations
for col in ["Elements","Elements_s","Diff"]:
    Changed_df[col] = Changed_df[col].apply(lambda d: d if isinstance(d, list) else [])

for idx,row in Changed_df.iterrows():
    diff = list(set(row['Elements']) - set(Elements_so_far))
    Changed_df.at[idx, "Diff"] = diff
    Elements_so_far.append(row['Elements'])
    Elements_so_far = flatten(Elements_so_far)
    Elements_so_far = list(set(Elements_so_far)) #keep unique elements
    Changed_df.loc[idx,"count"]=diff.__len__()

Commentary about the code:

I'd appreciate if an expert could help me with a vectorized version of the code.


I did try the vectorized version, but I couldn't go too far.

#attempt 2:
Changed_df.apply(lambda x: [i for i in x['Elements'] if i in x['Elements_s']], axis=1)

I was inspired from How to compare two columns both with list of strings and create a new column with unique items? to do above, but I couldn't do it. The linked SO thread does row-wise difference among columns.

I am using Python 3.6.7 by Anaconda. Pandas version is 0.23.4

Upvotes: 1

Views: 823

Answers (2)

Abhi
Abhi

Reputation: 4233

One alternative using drop duplicates and groupby

# Groupby and apply list func.
df1 = df.groupby('ID')['Value'].apply(list).to_frame('Elements')

# Sort values , drop duplicates by Value column then use groupby.
df1['Diff'] = df.sort_values(['ID','Value']).drop_duplicates('Value').groupby('ID')['Value'].apply(list)

# Use str.len for count.
df1['Count'] = df1['Diff'].str.len().fillna(0).astype(int)

# To fill NaN with empty list
df1['Diff'] = df1.Diff.apply(lambda x: x if type(x)==list else []) 


           Elements             Diff   Count
ID               
A   [1, 4, 5, 2, 3]  [1, 2, 3, 4, 5]     5
B            [2, 3]               []     0
C               [3]               []     0
D               [7]              [7]     1
E         [2, 3, 9]              [9]     1

Upvotes: 1

AChampion
AChampion

Reputation: 30258

You could using sort and then use numpy to get the unique indexes and then construct your groupings, e.g.:

In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
df.iloc[i].groupby(df.ID).Value.apply(list)

Out[]:
ID
A    [1, 2, 3, 4, 5]
D                [7]
E                [9]
Name: Value, dtype: object

Or to get close to your current output:

In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
s1 = df.groupby(df.ID).Value.apply(list).rename('Elements')
s2 = df.iloc[i].groupby(df.ID).Value.apply(list).rename('Diff').reindex(s1.index, fill_value=[])

pd.concat([s1, s2, s2.apply(len).rename('Count')], axis=1)

Out[]:
           Elements             Diff  Count
ID
A   [1, 4, 5, 2, 3]  [1, 2, 3, 4, 5]      5
B            [2, 3]               []      0
C               [3]               []      0
D               [7]              [7]      1
E         [2, 3, 9]              [9]      1

Upvotes: 2

Related Questions