Ben C Wang
Ben C Wang

Reputation: 607

Faster way to count total occurrences of values in a column of lists in pandas?

I have a column in a pandas dataframe which contains a large number of lists of labels:

>>> data['SPLIT'].head(10)
0    [33.23, 88.72, 38.93, E931.7, V09.0, 041.11, 5...
1    [99.04, 38.06, 39.57, 00.91, 55.69, V15.82, 27...
2    [96.04, 96.72, 401.9, 276.5, 584.9, 428.0, 507...
3    [96.6, 99.15, 99.83, V29.0, 765.15, 765.25, 77...
4    [96.71, 96.04, 54.12, 99.60, 38.93, 99.15, 53....
5    [88.72, 37.61, 39.61, 36.15, 36.12, 272.0, 401...
6    [38.93, 88.72, 37.31, 272.4, 719.46, 722.0, 31...
7    [88.72, 39.61, 35.71, 272.4, V12.59, 458.29, 7...
8    [97.44, 99.04, 88.56, 37.23, 39.95, 38.95, 00....
9    [00.14, 89.61, 39.95, E878.8, 244.9, 443.9, 18...

What I am trying to do is iterate through all of these lists to find the total occurrence of each value so that I can find the 50 most frequently occurring values.

This is the code I have used which runs extremely slowly:

test = pd.Series(sum([item for item in data.SPLIT], [])).value_counts()

I tried to write a function outside to loop through the lists and find the count as well, but this was also very slow.

Is there any way I can modify this data or use a function in pandas that has similar performance to something like df.groupby.count()?

I did search for a half hour on google and stackoverflow but none of the answers have better performance. I have been trying for a long time to figure out a way to flatten the list or find a way to map the counts with better speed (iterating through 500k rows, and the length of each list varies, some can be as long as 512, others as short as 2).

Upvotes: 6

Views: 7288

Answers (2)

jezrael
jezrael

Reputation: 862511

Use list comprehension with flattening instead sum:

test = pd.Series([x for item in data.SPLIT for x in item]).value_counts()

Or flatten by chain.from_iterable:

from itertools import chain

test = pd.Series(list(chain.from_iterable(data.SPLIT))).value_counts()

Or use also collections.Counter:

from itertools import chain
from collections import Counter

test = pd.Series(Counter(chain.from_iterable(data.SPLIT)))

Or:

import functools, operator

test = pd.Series(functools.reduce(operator.iconcat, data.SPLIT, [])).value_counts()

Pure pandas solution:

test = pd.DataFrame(data.SPLIT.values.tolist()).stack().value_counts()

Upvotes: 10

pythonjokeun
pythonjokeun

Reputation: 431

How about this?

import pandas as pd

split = data["SPLIT"].apply(pd.Series)
split = split.rename(columns = lambda x : 'val_' + str(x))
split.melt(value_name="val").groupby(["val"]).size()

Upvotes: 2

Related Questions