pc_pyr
pc_pyr

Reputation: 632

List of list comparison with Dataframe columns (more than one column)

I have a list of list like:

l=[['A,B,C,D', 'Z', 'category1'],
 ['A,B,C', 'Z', 'category2'],
 ['E,F,G,H', 'T', 'category4'],
 ['A,E,H', 'T', 'category4']]

and a DataFrame like:

Col1       Col2       Col3
A          Z          category1
A,B,C,D    Z          category1
E,F,G,H,I  T          category4
A,E,H      T          category4 
A,E,H      T          category5
A,E,H      U          category4   

So I want to compare the list of list with the dataframe columns (taking all 3 columns into consideration). Comparison with loop is not working efficiently, is there a way I can make the comparison and return only matching rows from dataframe ?

Desired Output DataFrame:

Col1       Col2       Col3
A,B,C,D    Z          category1
A,E,H      T          category4 

Thanks!!

Upvotes: 0

Views: 66

Answers (3)

0buz
0buz

Reputation: 3503

One way is using pandas.Series.duplicated method:

#df = <your DataFrame defined here>

l=[['A,B,C,D', 'Z', 'category1'],
 ['A,B,C', 'Z', 'category2'],
 ['E,F,G,H', 'T', 'category4'],
 ['A,E,H', 'T', 'category4']]

listdf=pd.DataFrame(data = l, columns=df.columns)   #transform list to dataframe
df=df.append(listdf)  #append your list to the main dataframe

duplicates=df.duplicated() # find duplicate rows; returns series of booleans

print(df[duplicates]) # print duplicate rows

Output:

#      Col1 Col2       Col3
#0  A,B,C,D    Z  category1
#3    A,E,H    T  category4

Upvotes: 1

jezrael
jezrael

Reputation: 862681

Create DataFrame from list and then use DataFrame.merge with default inner join, if no parameter on it merge by all columns in both DataFrames:

l=[['A,B,C,D', 'Z', 'category1'],
 ['A,B,C', 'Z', 'category2'],
 ['E,F,G,H', 'T', 'category4'],
 ['A,E,H', 'T', 'category4']]

df = pd.DataFrame(l, columns=['Col1','Col2','Col3']).merge(df)
print (df)
      Col1 Col2       Col3
0  A,B,C,D    Z  category1
1    A,E,H    T  category4

Upvotes: 1

tianlinhe
tianlinhe

Reputation: 989

If you look for a exact match of all 3 columns, why not try to concatenate the three columns to a fourth column:

df['col4']=df['col1']+df['col2']+df['col3']

which returns:

        col1 col2       col3                 col4
0          A    Z  category1          AZcategory1
1    A,B,C,D    Z  category1    A,B,C,DZcategory1
2  E,F,G,H,I    T  category4  E,F,G,H,ITcategory4
3      A,E,H    T  category4      A,E,HTcategory4
4      A,E,H    T  category5      A,E,HTcategory5
5      A,E,H    U  category4      A,E,HUcategory4

And you can compare 'col4' with the concatenated strings in your list list_join.

list_join=[''.join(item) for item in l]

which looks like:

['A,B,C,DZcategory1', 'A,B,CZcategory2', 'E,F,G,HTcategory4', 'A,E,HTcategory4']

And then merge based on the common values:

common =df[df['col4'].isin(list_join)]

The advantage is that you only have to search in a single column instead of all three, which expediates your function a lot. The disadvantage would be allows less flexibility.

Upvotes: 1

Related Questions