cchev
cchev

Reputation: 179

df.value.counts() doesn't show number of occurrences in dataset

Here is a small sample of the data I'm working on.

enter image description here

I'm trying to calculate how many times the same ID appears in the data using

df['Total Occurences'] = df['ID'].value_counts()

but nothing appears in the new column.

Thanks in advance :)

Upvotes: 1

Views: 627

Answers (2)

Pedro Maia
Pedro Maia

Reputation: 2722

Try this:

df['Total Occurences'] = df['ID'].apply(lambda x: df['ID'].value_counts()[x])

For performance create a variable with df['ID'].value_counts():

count = df['ID'].value_counts()
df['Total Occurences'] = df['ID'].apply(lambda x: count[x])

Upvotes: -1

Rodalm
Rodalm

Reputation: 5433

Using groupby + transform or value_counts + map should be the preferred ways of doing it.

df['Total Occurences'] = df.groupby('ID')['ID'].transform('count')

or

df['Total Occurences'] = df['ID'].map(df.value_counts('ID'))

Both ways are much faster than the other answer for large DataFrames.

Tests

n = 10_000
# DataFrame with 'n' random IDs (50 possible values)
df = pd.DataFrame({'ID': np.random.randint(50, size=n)})
# using groupby + transform
>>> %timeit df.groupby('ID')['ID'].transform('count')
1.03 ms ± 43.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# using map + value_counts
>>> %timeit df['ID'].map(df['ID'].value_counts())
1.49 ms ± 286 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# using apply (Pedro's solution)
>>> %timeit df['ID'].apply(lambda x: df['ID'].value_counts()[x])
8.96 s ± 742 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# computing value_counts only once outside apply
>>> %%timeit 
... counts = df['ID'].value_counts()
... df['ID'].apply(lambda x: counts[x])

57.6 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 5

Related Questions