Reputation: 187
I concatenated 500 XSLX-files, which has the shape (672006, 12). All processes have a unique number, which I want to groupby() the data to obtain relevant information. For temperature I would like to select the first and for number the most frequent value.
Test data:
df_test =
pd.DataFrame({"number": [1,1,1,1,2,2,2,2,3,3,3,3],
'temperature': [2,3,4,5,4,3,4,5,5, 3, 4, 4],
'height': [100, 100, 0, 100, 100, 90, 90, 100, 100, 90, 80, 80]})
df_test.groupby('number')['temperature'].first()
df_test.groupby('number')['height'].agg(lambda x: x.value_counts().index[0])
I get the following error for trying to getting the most frequent height per number: IndexError: index 0 is out of bounds for axis 0 with size 0
Strange enough, mean() / first() / max() etc are all working. And on the second part of the dataset that I concatenated seperately the aggregation worked.
Can somebody suggest what to do with this error? Thanks!
Upvotes: 2
Views: 11991
Reputation: 153460
I think your problem is one or more of your groupby is returning all NaN heights:
See this example, where I added a number 4 with np.NaN as its heights.
df_test = pd.DataFrame({"number": [1,1,1,1,2,2,2,2,3,3,3,3,4,4],
'temperature': [2,3,4,5,4,3,4,5,5, 3, 4, 4, 5, 5],
'height': [100, 100, 0, 100, 100, 90, 90, 100, 100, 90, 80, 80, np.nan, np.nan]})
df_test.groupby('number')['temperature'].first()
df_test.groupby('number')['height'].agg(lambda x: x.value_counts().index[0])
Output:
IndexError: index 0 is out of bounds for axis 0 with size 0
Let's fill those NaN with zero and rerun.
df_test = pd.DataFrame({"number": [1,1,1,1,2,2,2,2,3,3,3,3,4,4],
'temperature': [2,3,4,5,4,3,4,5,5, 3, 4, 4, 5, 5],
'height': [100, 100, 0, 100, 100, 90, 90, 100, 100, 90, 80, 80, np.nan, np.nan]})
df_test = df_test.fillna(0) #Add this line
df_test.groupby('number')['temperature'].first()
df_test.groupby('number')['height'].agg(lambda x: x.value_counts().index[0])
Output:
number
1 100.0
2 90.0
3 80.0
4 0.0
Name: height, dtype: float64
Upvotes: 1