Reputation: 55
I have a DataFrame with a species column and four arbitrary data columns. I want to group it by species and sum up the four data columns for each one. I've tried to do this in two ways: once by grouping by df.columns[0]
and once by grouping by df.iloc[:, 0]
.
data = {
'species': ['a', 'b', 'c', 'd', 'e', 'rt', 'gh', 'ed', 'e', 'd', 'd', 'q', 'ws', 'f', 'fg', 'a', 'a', 'a', 'a', 'a'],
's1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
's2': [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9],
's3': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
's4': [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]
}
df = pd.DataFrame(data)
grouped_df1 = df.groupby(df.columns[0], as_index=False).sum()
grouped_df2 = df.groupby(df.iloc[:, 0], as_index=False).sum()
Both methods correctly sum the data in the four rightmost columns. But for some reason, the second method also sums up the names of the species, concatenating them into one long, repeating string.
Here's the result from the first method, which is what I'm looking for:
print(grouped_df1)
species s1 s2 s3 s4
0 a 91 54 97 60
1 b 2 9 3 10
2 c 3 9 4 10
3 d 25 27 28 30
4 e 14 18 16 20
5 ed 8 9 9 10
6 f 14 9 15 10
7 fg 15 9 16 10
8 gh 7 9 8 10
9 q 12 9 13 10
10 rt 6 9 7 10
11 ws 13 9 14 10
And here's the result from the df.iloc
method, which incorrectly sums up the species
data:
print(grouped_df2)
species s1 s2 s3 s4
0 aaaaaa 91 54 97 60
1 b 2 9 3 10
2 c 3 9 4 10
3 ddd 25 27 28 30
4 ee 14 18 16 20
5 ed 8 9 9 10
6 f 14 9 15 10
7 fg 15 9 16 10
8 gh 7 9 8 10
9 q 12 9 13 10
10 rt 6 9 7 10
11 ws 13 9 14 10
Why is the second method summing up the species names as well as the numerical data?
Upvotes: 2
Views: 54
Reputation: 4251
In groupby
- column name is treated as an intrinsic grouping key, while a Series is treated as an external key.
Reference - https://pandas.pydata.org/docs/reference/groupby.html
When using df.iloc[:, 0]:
Pandas considers the string values in the species column as a separate grouping key independent of the DataFrame structure.
When using df.columns[0]:
Pandas directly uses the column 'species' within the DataFrame as the grouping key. This allows Pandas to manage the grouping and summation correctly.
You should always reference the column name explicitly
grouped_df1 = df.groupby('species', as_index=False).sum()
Or this also works
grouped_df1 = df.groupby(df[df.columns[0]], as_index=False).sum()
Upvotes: 0
Reputation: 5470
df.groupby(df.columns[0]...
correctly groups on the first column although it would be usual just to use df.groupby('species')...
. When you use df.groupby(df.iloc[:, 0]...
then this applies sum
to the content of the first column (ie. concatenates the String values) as well applying sum
to the other numerical columns .
If you try print(df.columns[0]
) and also print(df.iloc[:, 0])
then you will see that the first is the selected column name and the second is a Pandas Series with the values in the column.
Upvotes: 0