PATRICIO GALEAS
PATRICIO GALEAS

Reputation: 91

How to sum columns containing arrays

I have a problem to summarize the columns of a dataframe containing arrays in each cell. dataframe I tried to summarize the columns using df.sum(), expecting to get the total column array, for example [4,1,1,4,1] for the column 'common'. But I got only an empty Series.

df_sum = df.sum()
print(df_sum)

Series([], dtype: float64)

How can I get the summarized column in this case?

Upvotes: 1

Views: 1954

Answers (2)

jxc
jxc

Reputation: 13998

IIUC, you can probably just use list comprehension to handle your task:

df = pd.DataFrame({'d1':[np.nan, [1,2], [4]], 'd2':[[3], np.nan, np.nan]})

>>> df
       d1   d2
0     NaN  [3]
1  [1, 2]  NaN
2     [4]  NaN

df_sum = [i for a in df['d1'] if type(a) is list for i in a]

>>> df_sum
[1, 2, 4]

If you need to do sum on the whole DataFrame (or multiple columns), then use numpy.ravel() to flatten the dataframe before using the list comprehension.

df_sum = [i for a in np.ravel(df.values) if type(a) is list for i in a]

>>> df_sum
[3, 1, 2, 4]

Upvotes: 0

rafaelc
rafaelc

Reputation: 59274

Well, working with object dtypes in pandas DataFrames are usually not a good idea, especially filling cells with python lists, because you lose performance.

Nevertheless, you may accomplish this by using itertools.chain.from_iterable

df.apply(lambda s: list(it.chain.from_iterable(s.dropna())))

You may also use sum, but I'd say it's slower

df.apply(lambda s: s.dropna().sum())

I can see why you'd think df.sum would work here, even setting skipna=True explicitly, but the vectorized df.sum shows a weird behavior in this situation. But then again, these are the downsides of using a DataFrame with lists in it

Upvotes: 1

Related Questions