Ray
Ray

Reputation: 55

Why does summing data grouped by df.iloc[:, 0] also sum up the column names?

I have a DataFrame with a species column and four arbitrary data columns. I want to group it by species and sum up the four data columns for each one. I've tried to do this in two ways: once by grouping by df.columns[0] and once by grouping by df.iloc[:, 0].

data = {
    'species': ['a', 'b', 'c', 'd', 'e', 'rt', 'gh', 'ed', 'e', 'd', 'd', 'q', 'ws', 'f', 'fg', 'a', 'a', 'a', 'a', 'a'],
    's1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    's2': [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9],
    's3': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
    's4': [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]
}
df = pd.DataFrame(data)

grouped_df1 = df.groupby(df.columns[0], as_index=False).sum()
grouped_df2 = df.groupby(df.iloc[:, 0], as_index=False).sum()

Both methods correctly sum the data in the four rightmost columns. But for some reason, the second method also sums up the names of the species, concatenating them into one long, repeating string.

Here's the result from the first method, which is what I'm looking for:

print(grouped_df1)
   species  s1  s2  s3  s4
0        a  91  54  97  60
1        b   2   9   3  10
2        c   3   9   4  10
3        d  25  27  28  30
4        e  14  18  16  20
5       ed   8   9   9  10
6        f  14   9  15  10
7       fg  15   9  16  10
8       gh   7   9   8  10
9        q  12   9  13  10
10      rt   6   9   7  10
11      ws  13   9  14  10

And here's the result from the df.iloc method, which incorrectly sums up the species data:

print(grouped_df2)
   species  s1  s2  s3  s4
0   aaaaaa  91  54  97  60
1        b   2   9   3  10
2        c   3   9   4  10
3      ddd  25  27  28  30
4       ee  14  18  16  20
5       ed   8   9   9  10
6        f  14   9  15  10
7       fg  15   9  16  10
8       gh   7   9   8  10
9        q  12   9  13  10
10      rt   6   9   7  10
11      ws  13   9  14  10

Why is the second method summing up the species names as well as the numerical data?

Upvotes: 2

Views: 54

Answers (2)

Bhargav
Bhargav

Reputation: 4251

In groupby - column name is treated as an intrinsic grouping key, while a Series is treated as an external key.

Reference - https://pandas.pydata.org/docs/reference/groupby.html

When using df.iloc[:, 0]:

Pandas considers the string values in the species column as a separate grouping key independent of the DataFrame structure.

When using df.columns[0]:

Pandas directly uses the column 'species' within the DataFrame as the grouping key. This allows Pandas to manage the grouping and summation correctly.

Code COrrection

You should always reference the column name explicitly

grouped_df1 = df.groupby('species', as_index=False).sum()

Or this also works

grouped_df1 = df.groupby(df[df.columns[0]], as_index=False).sum()

Upvotes: 0

user19077881
user19077881

Reputation: 5470

df.groupby(df.columns[0]... correctly groups on the first column although it would be usual just to use df.groupby('species').... When you use df.groupby(df.iloc[:, 0]... then this applies sum to the content of the first column (ie. concatenates the String values) as well applying sum to the other numerical columns .

If you try print(df.columns[0]) and also print(df.iloc[:, 0]) then you will see that the first is the selected column name and the second is a Pandas Series with the values in the column.

Upvotes: 0

Related Questions