Amine D. Ben Moussa
Amine D. Ben Moussa

Reputation: 13

pandas sum data different result

I am working on some pandas exercises from kaggle. I tried to solve an exercise, but I don't understand why the result is different from what I expected.

Question:

There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset. (For simplicity, let's ignore the capitalized versions of these words.)"

My answer:

tropical_count= reviews["description"].str.count(pat ="tropical").sum()
fruity_count= reviews["description"].str.count(pat ="fruity").sum()

descriptor_counts = pd.Series({"tropical":tropical_count,"fruity":fruity_count},index=["tropical","fruity"])

Kaggle answer:

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

All work great, but the results are different, does anyone know why?

My result

tropical    3703
fruity      9259
dtype: int64

Kaggle result

tropical    3607
fruity      9090
dtype: int64

Upvotes: 1

Views: 632

Answers (2)

jezrael
jezrael

Reputation: 863451

Output is expected, because str.count counts substrings, but if use in operator it test only if exist value. So ouput is only True or False. Then if use sum boolean Trues are processing like 1 and False like 0, so ouput is different.

Sample:

reviews = pd.DataFrame(["Ttropical are tropical so fruity words you can",
                   "fruity ",
                   "fruity fruity",
                   "anythi"], columns=['description'])

tropical_count= reviews["description"].str.count(pat ="tropical")
fruity_count= reviews["description"].str.count(pat ="fruity")
print (tropical_count)
0    2
1    0
2    0
3    0
Name: description, dtype: int64
print (fruity_count)
0    1
1    1
2    2
3    0
Name: description, dtype: int64

n_trop = reviews.description.map(lambda desc: "tropical" in desc)
n_fruity = reviews.description.map(lambda desc: "fruity" in desc)
print (n_trop)
0     True
1    False
2    False
3    False
Name: description, dtype: bool

print (n_fruity)
0     True
1     True
2     True
3    False
Name: description, dtype: bool

Upvotes: 1

Ziur Olpa
Ziur Olpa

Reputation: 2133

counts(pat=..), counts the number of times the pattern is in the string so it can add 2 per row (or more), tropical in desc will evaluate true or false only counting one even if is repeated.

For instance this dataframe with two entries sums 3 under the "count" construct:

df = pd.DataFrame({'name':['tropical','tropicaltropical']})
df.name.str.count(pat ="tropical").sum()

The "in" construct will sum only 2, one per row.

Upvotes: 1

Related Questions