Reputation: 91
I have a problem to summarize the columns of a dataframe containing arrays in each cell.
I tried to summarize the columns using df.sum(), expecting to get the total column array, for example [4,1,1,4,1] for the column 'common'.
But I got only an empty Series.
df_sum = df.sum()
print(df_sum)
Series([], dtype: float64)
How can I get the summarized column in this case?
Upvotes: 1
Views: 1954
Reputation: 13998
IIUC, you can probably just use list comprehension to handle your task:
df = pd.DataFrame({'d1':[np.nan, [1,2], [4]], 'd2':[[3], np.nan, np.nan]})
>>> df
d1 d2
0 NaN [3]
1 [1, 2] NaN
2 [4] NaN
df_sum = [i for a in df['d1'] if type(a) is list for i in a]
>>> df_sum
[1, 2, 4]
If you need to do sum on the whole DataFrame (or multiple columns), then use numpy.ravel() to flatten the dataframe before using the list comprehension.
df_sum = [i for a in np.ravel(df.values) if type(a) is list for i in a]
>>> df_sum
[3, 1, 2, 4]
Upvotes: 0
Reputation: 59274
Well, working with object
dtypes
in pandas DataFrames are usually not a good idea, especially filling cells with python lists, because you lose performance.
Nevertheless, you may accomplish this by using itertools.chain.from_iterable
df.apply(lambda s: list(it.chain.from_iterable(s.dropna())))
You may also use sum
, but I'd say it's slower
df.apply(lambda s: s.dropna().sum())
I can see why you'd think df.sum
would work here, even setting skipna=True
explicitly, but the vectorized df.sum
shows a weird behavior in this situation. But then again, these are the downsides of using a DataFrame with lists in it
Upvotes: 1