Reputation: 632
I have a list of list like:
l=[['A,B,C,D', 'Z', 'category1'],
['A,B,C', 'Z', 'category2'],
['E,F,G,H', 'T', 'category4'],
['A,E,H', 'T', 'category4']]
and a DataFrame like:
Col1 Col2 Col3
A Z category1
A,B,C,D Z category1
E,F,G,H,I T category4
A,E,H T category4
A,E,H T category5
A,E,H U category4
So I want to compare the list of list with the dataframe columns (taking all 3 columns into consideration). Comparison with loop is not working efficiently, is there a way I can make the comparison and return only matching rows from dataframe ?
Desired Output DataFrame:
Col1 Col2 Col3
A,B,C,D Z category1
A,E,H T category4
Thanks!!
Upvotes: 0
Views: 66
Reputation: 3503
One way is using pandas.Series.duplicated method:
#df = <your DataFrame defined here>
l=[['A,B,C,D', 'Z', 'category1'],
['A,B,C', 'Z', 'category2'],
['E,F,G,H', 'T', 'category4'],
['A,E,H', 'T', 'category4']]
listdf=pd.DataFrame(data = l, columns=df.columns) #transform list to dataframe
df=df.append(listdf) #append your list to the main dataframe
duplicates=df.duplicated() # find duplicate rows; returns series of booleans
print(df[duplicates]) # print duplicate rows
Output:
# Col1 Col2 Col3
#0 A,B,C,D Z category1
#3 A,E,H T category4
Upvotes: 1
Reputation: 862681
Create DataFrame
from list and then use DataFrame.merge
with default inner join, if no parameter on
it merge by all columns in both DataFrame
s:
l=[['A,B,C,D', 'Z', 'category1'],
['A,B,C', 'Z', 'category2'],
['E,F,G,H', 'T', 'category4'],
['A,E,H', 'T', 'category4']]
df = pd.DataFrame(l, columns=['Col1','Col2','Col3']).merge(df)
print (df)
Col1 Col2 Col3
0 A,B,C,D Z category1
1 A,E,H T category4
Upvotes: 1
Reputation: 989
If you look for a exact match of all 3 columns, why not try to concatenate the three columns to a fourth column:
df['col4']=df['col1']+df['col2']+df['col3']
which returns:
col1 col2 col3 col4
0 A Z category1 AZcategory1
1 A,B,C,D Z category1 A,B,C,DZcategory1
2 E,F,G,H,I T category4 E,F,G,H,ITcategory4
3 A,E,H T category4 A,E,HTcategory4
4 A,E,H T category5 A,E,HTcategory5
5 A,E,H U category4 A,E,HUcategory4
And you can compare 'col4' with the concatenated strings in your list list_join
.
list_join=[''.join(item) for item in l]
which looks like:
['A,B,C,DZcategory1', 'A,B,CZcategory2', 'E,F,G,HTcategory4', 'A,E,HTcategory4']
And then merge based on the common values:
common =df[df['col4'].isin(list_join)]
The advantage is that you only have to search in a single column instead of all three, which expediates your function a lot. The disadvantage would be allows less flexibility.
Upvotes: 1