How to compare columns in a pandas dataframe

Question

I have a pandas dataframe that looks like this with "Word" as the column header for all the columns:

   Word    Word    Word    Word
0  Nap     Nap     Nap     Cat
1  Cat     Cat     Cat     Flower
2  Peace   Kick    Kick    Go
3  Phone   Fin     Fin     Nap

How can only return the words that appear in all 4 columns?

Expected Output:

  Word
0 Nap
1 Cat

piRSquared · Accepted Answer

Use apply(set) to turn each column into a set of words
Use set.intersection to find all words in each column's set
Turn it into a list and then a series

pd.Series(list(set.intersection(*df.apply(set))))

0    Cat
1    Nap
dtype: object

We can accomplish the same task with some python functional magic to get some performance benefit.

pd.Series(list(
    set.intersection(*map(set, map(lambda c: df[c].values.tolist(), df)))
))

0    Cat
1    Nap
dtype: object

Timing
Code Below

pir1 = lambda d: pd.Series(list(set.intersection(*d.apply(set))))
pir2 = lambda d: pd.Series(list(set.intersection(*map(set, map(lambda c: d[c].values.tolist(), d)))))
# I took some liberties with @Anton vBR's solution.
vbr = lambda d: pd.Series((lambda x: x.index[x.values == len(d.columns)])(pd.value_counts(d.values.ravel())))

results = pd.DataFrame(
    index=pd.Index([10, 30, 100, 300, 1000, 3000, 10000, 30000]),
    columns='pir1 pir2 vbr'.split()
)

for i in results.index:
    d = pd.concat(dict(enumerate(
        [pd.Series(np.random.choice(words[:i*2], i, False)) for _ in range(4)]
    )), axis=1)
    for j in results.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        results.set_value(i, j, timeit(stmt, setp, number=100))

results.plot(loglog=True)

How to compare columns in a pandas dataframe

Answers (2)

Related Questions