Reputation: 81
In a dataframe with a column named source
, made of two different word lists
source words letter_count
1 list1 apple 5
2 list1 pear 4
3 list1 banana 6
4 list2 ford 4
5 list2 chevy 5
6 list2 apple 5
7 list2 banana 6
I'm trying to return a new dataframe that shows the duplicate words in list1 and list2
words letter_count
1 apple 5
2 banana 6
I'm using python and pandas
Upvotes: 0
Views: 84
Reputation: 8768
Here is a way to find if the same word exists in both lists in the source column.
df.loc[df['words'].isin(set.intersection(*df.groupby('source')['words'].agg(set))),['words','letter_count']].drop_duplicates('words',keep='last')
or:
l = ['words','letter_count']
m1 = df.duplicated(['words','letter_count'])
m2 = df.groupby('words')['source'].transform('nunique').eq(df['source'].nunique())
df.loc[m1 & m2,l]
Output:
words letter_count
6 apple 5
7 banana 6
Upvotes: 0
Reputation:
I think you're looking for pandas.Series.duplicated()
. It returns a mask (a series containing True/False values corresponding to values that match a condition) where values that occur more than once in the series are True, and those that occur only are False. Then, you can index the dataframe with that mask:
new_df = df[df['words'].duplicated()].drop('source', axis=1)
Output:
>>> new_df
words letter_count
6 banana 6
7 apple 5
Upvotes: 1