Row-wise difference in two list in pandas

Question

I am using pandas to incrementally find out new elements i.e. for every row, I'd see whether values in list have been seen before. If they are, we will ignore them. If not, we will select them.

I was able to do this using row.iterrows(), but I have >1M rows, so I believe vectorized apply might be better.

Here's sample data and code. Once you run this code, you will get expected output:

from numpy import nan as NA
import collections

df = pd.DataFrame({'ID':['A','B','C','A','B','A','A','A','D','E','E','E'],
                   'Value': [1,2,3,4,3,5,2,3,7,2,3,9]})
#wrap all elements by group in a list
Changed_df=df.groupby('ID')['Value'].apply(list).reset_index() 
Changed_df=Changed_df.rename(columns={'Value' : 'Elements'})
Changed_df=Changed_df.reset_index(drop=True)



def flatten(l):
    for el in l:
        if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
            yield from flatten(el)
        else:
            yield el

Changed_df["Elements_s"]=Changed_df['Elements'].shift()

#attempt 1: For loop
Changed_df["Diff"]=NA
Changed_df["count"]=0
Elements_so_far = []

#replace NA with empty list in columns that will go through list operations
for col in ["Elements","Elements_s","Diff"]:
    Changed_df[col] = Changed_df[col].apply(lambda d: d if isinstance(d, list) else [])

for idx,row in Changed_df.iterrows():
    diff = list(set(row['Elements']) - set(Elements_so_far))
    Changed_df.at[idx, "Diff"] = diff
    Elements_so_far.append(row['Elements'])
    Elements_so_far = flatten(Elements_so_far)
    Elements_so_far = list(set(Elements_so_far)) #keep unique elements
    Changed_df.loc[idx,"count"]=diff.__len__()

Commentary about the code:

I am not a fan of this code because it's clunky and inefficient.
- I am saying inefficient because I have created Elements_s which holds shifted values. Another reason for inefficiency is for loop through rows.
Elements_so_far keeps track of all the elements we have discovered for every row. If there is a new element that shows up, we count that in Diff column.
We also keep track of the length of new elements discovered in count column.

I'd appreciate if an expert could help me with a vectorized version of the code.

I did try the vectorized version, but I couldn't go too far.

#attempt 2:
Changed_df.apply(lambda x: [i for i in x['Elements'] if i in x['Elements_s']], axis=1)

I was inspired from How to compare two columns both with list of strings and create a new column with unique items? to do above, but I couldn't do it. The linked SO thread does row-wise difference among columns.

I am using Python 3.6.7 by Anaconda. Pandas version is 0.23.4

AChampion · Accepted Answer

You could using sort and then use numpy to get the unique indexes and then construct your groupings, e.g.:

In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
df.iloc[i].groupby(df.ID).Value.apply(list)

Out[]:
ID
A    [1, 2, 3, 4, 5]
D                [7]
E                [9]
Name: Value, dtype: object

Or to get close to your current output:

In []:
df = df.sort_values(by='ID').reset_index(drop=True)
_, i = np.unique(df.Value.values, return_index=True)
s1 = df.groupby(df.ID).Value.apply(list).rename('Elements')
s2 = df.iloc[i].groupby(df.ID).Value.apply(list).rename('Diff').reindex(s1.index, fill_value=[])

pd.concat([s1, s2, s2.apply(len).rename('Count')], axis=1)

Out[]:
           Elements             Diff  Count
ID
A   [1, 4, 5, 2, 3]  [1, 2, 3, 4, 5]      5
B            [2, 3]               []      0
C               [3]               []      0
D               [7]              [7]      1
E         [2, 3, 9]              [9]      1

Row-wise difference in two list in pandas

Answers (2)

Related Questions