My Work
My Work

Reputation: 2508

Pandas find difference in counts of cell containing certain string

If I have a dataset, like this:

import pandas as pd

df_fruit = pd.DataFrame({
    "fruit1": ['apple pie','orange','banana pie','mango','grape juice','strawberry milkshake'],
    "fruit2": ['apple','orange juice','banana','mango pie','grapes','strawberry'],
    "fruit3": ['apple juice','orange pie','banana juice','mango','grapes','strawberry pie']
})

and now I'd like to know what the difference in pie counts is in each column (ie. by how many pies is one of the column richer), I can do this:

df_fruit['fruit1'].str.contains('pie').sum()-df_fruit['fruit2'].str.contains('pie').sum()
# 1

That is nice but readability counts (eg. imagine checking >3 columns). Is there a way how to find the relevant rows and print a) same as I did just without the need to access the columns separately, b) the pairwise difference in count of cells with a given string for given columns (eg. fruit1-fruit2 = 1, fruit1-fruit3=...)?

Thanks.

Upvotes: 0

Views: 39

Answers (1)

Sayandip Dutta
Sayandip Dutta

Reputation: 15872

You can calculate pie count per column, and then use pandas.DataFrame.eval to do different operations:

>>> pie_count = df_fruit.apply(lambda x: x.str.contains('pie')).sum().to_frame().T
>>> pie_count
   fruit1  fruit2  fruit3
0       2       1       2

>>> pie_count.eval('fruit1 - fruit2').item()
1

>>> pie_count.eval('fruit1 - fruit3').item()
1

If you are in python 3.8+:

>>> from itertools import permutations, combinations
>>> {(expr:= f"{c1} - {c2}"): 
      pie_count.eval(expr).item() for c1, c2 in permutations(pie_count.columns, r=2)}

{'fruit1 - fruit2': 1,
 'fruit1 - fruit3': 0,
 'fruit2 - fruit1': -1,
 'fruit2 - fruit3': -1,
 'fruit3 - fruit1': 0,
 'fruit3 - fruit2': 1}

>>> {(expr:= f"{c1} - {c2}"): 
      pie_count.eval(expr).item() for c1, c2 in combinations(pie_count.columns, r=2)}
{'fruit1 - fruit2': 1, 'fruit1 - fruit3': 0, 'fruit2 - fruit3': -1}

To just get fruit1-fruit2 and fruit2-fruit3:

>>> pie_count = df_fruit.apply(lambda x: x.str.contains('pie')).sum()
>>> pie_count['fruit1'] - pie_count.drop('fruit1')
# Or
>>> pie_count.drop('fruit1').rsub(pie_count['fruit1'])
fruit2    1
fruit3    0
dtype: int64

Upvotes: 1

Related Questions