nobodyAskedYouPatrice
nobodyAskedYouPatrice

Reputation: 131

How to compare columns in a pandas dataframe

I have a pandas dataframe that looks like this with "Word" as the column header for all the columns:

   Word    Word    Word    Word
0  Nap     Nap     Nap     Cat
1  Cat     Cat     Cat     Flower
2  Peace   Kick    Kick    Go
3  Phone   Fin     Fin     Nap

How can only return the words that appear in all 4 columns?

Expected Output:

  Word
0 Nap
1 Cat

Upvotes: 1

Views: 215

Answers (2)

piRSquared
piRSquared

Reputation: 294488

  • Use apply(set) to turn each column into a set of words
  • Use set.intersection to find all words in each column's set
  • Turn it into a list and then a series

pd.Series(list(set.intersection(*df.apply(set))))

0    Cat
1    Nap
dtype: object

We can accomplish the same task with some python functional magic to get some performance benefit.

pd.Series(list(
    set.intersection(*map(set, map(lambda c: df[c].values.tolist(), df)))
))

0    Cat
1    Nap
dtype: object

Timing
Code Below

enter image description here

pir1 = lambda d: pd.Series(list(set.intersection(*d.apply(set))))
pir2 = lambda d: pd.Series(list(set.intersection(*map(set, map(lambda c: d[c].values.tolist(), d)))))
# I took some liberties with @Anton vBR's solution.
vbr = lambda d: pd.Series((lambda x: x.index[x.values == len(d.columns)])(pd.value_counts(d.values.ravel())))

results = pd.DataFrame(
    index=pd.Index([10, 30, 100, 300, 1000, 3000, 10000, 30000]),
    columns='pir1 pir2 vbr'.split()
)

for i in results.index:
    d = pd.concat(dict(enumerate(
        [pd.Series(np.random.choice(words[:i*2], i, False)) for _ in range(4)]
    )), axis=1)
    for j in results.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setp, number=100))

results.plot(loglog=True)

Upvotes: 2

Anton vBR
Anton vBR

Reputation: 18916

Alternative solution (but this would require unique values).

tf = df.stack().value_counts()
df2 = pd.DataFrame(pd.Series(tf)).reset_index()
df2.columns = ["word", "count"]

    word    count
0   Nap     4
1   Cat     4
2   Fin     2
3   Kick    2
4   Go      1
5   Phone   1
6   Peace   1
7   Flower  1

This can be filtered with df2[df2["count"] == len(df.columns)]["word"]

0    Nap
1    Cat
Name: word, dtype: object

Upvotes: 1

Related Questions