Reputation: 1037
I want to create a count of unique values from one of my Pandas dataframe columns and then add a new column with those counts to my original data frame. I've tried a couple different things. I created a pandas series and then calculated counts with the value_counts method. I tried to merge these values back to my original dataframe, but I the keys that I want to merge on are in the Index(ix/loc).
Color Value
Red 100
Red 150
Blue 50
I want to return something like:
Color Value Counts
Red 100 2
Red 150 2
Blue 50 1
Upvotes: 86
Views: 138888
Reputation: 1
Used nunique
command along with dropna
to reduce NaN values. Also tested this in google collab.
df = pd.DataFrame({'Color': ['Red', 'Red', 'Blue'], 'Value': [100, 150, 50]})
total_counts = df.groupby('Color')['Value'].nunique(dropna=True)
df['Counts'] = df['Color'].transform(lambda x: total_counts[x])
print(df)
For more understanding of nunique
Read this blog.
Output:
Upvotes: 0
Reputation: 15
While there are already plenty of great answer here, and I personally believe using:
(given a dataframe = df)
df['new_value_col'] = df.groupby('colname_to_count')['colname_to_count'].transform('count')
is one of the best and most straightforward options.. I wanted to provide another method that I have used successfully.
import pandas as pd
import numpy as np
df['new_value_col'] = df.apply(lambda row: np.sum(df['col_to_count'] == row['col_to_count'], axis=1)
Where we are essentially turning the column that we want to count from into a series within the lambda expression and then using np.sum to count the occurrences of each value within the series.
Thought this could be useful, never bad to have multiple options!
Upvotes: 1
Reputation: 8961
Create a column containing the count of repeated values. The values are temporary calculations computed from other columns. Very fast. Credit to @ZakS.
sum_A_B = df['A']+df['B']
sum_A_B_dict = sum_A_B.value_counts().to_dict()
df['sum_A_B'] = sum_A_B.map(sum_A_B_dict)
Upvotes: 0
Reputation: 63516
This answer uses Series.map
with Series.value_counts
. It was tested with Pandas 1.1.
df['counts'] = df['attribute'].map(df['attribute'].value_counts())
Credit: comment by sacuL
Upvotes: 13
Reputation: 1131
One other option:
z = df['Color'].value_counts
z1 = z.to_dict() #converts to dictionary
df['Count_Column'] = df['Color'].map(z1)
This option will give you a column with repeated values of the counts, corresponding to the frequency of each value in the 'Color' column.
Upvotes: 21
Reputation: 27125
df['Counts'] = df.Color.groupby(df.Color).transform('count')
You can do this with any series: group it by itself and call transform('count')
:
>>> series = pd.Series(['Red', 'Red', 'Blue'])
>>> series.groupby(series).transform('count')
0 2
1 2
2 1
dtype: int64
Upvotes: 5
Reputation: 18654
My initial thought would be to use list comprehension as shown below but, as was pointed out in the comment, this is slower than the groupby
and transform
method. I will leave this answer to demonstrate WHAT NOT TO DO:
In [94]: df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]})
In [95]: df['Counts'] = [sum(df['Color'] == df['Color'][i]) for i in xrange(len(df))]
In [96]: df
Out[100]:
Color Value Counts
0 Red 100 2
1 Red 150 2
2 Blue 50 1
[3 rows x 3 columns]
@unutbu's method gets complicated for DataFrames with several columns which make this simpler to code. If you are working with a small data frame, this is faster (see below), but otherwise, you should use NOT use this.
In [97]: %timeit df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]}); df['Counts'] = df.groupby(['Color']).transform('count')
100 loops, best of 3: 2.87 ms per loop
In [98]: %timeit df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]}); df['Counts'] = [sum(df['Color'] == df['Color'][i]) for i in xrange(len(df))]
1000 loops, best of 3: 1.03 ms per loop
Upvotes: 3
Reputation: 880757
df['Counts'] = df.groupby(['Color'])['Value'].transform('count')
For example,
In [102]: df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]})
In [103]: df
Out[103]:
Color Value
0 Red 100
1 Red 150
2 Blue 50
In [104]: df['Counts'] = df.groupby(['Color'])['Value'].transform('count')
In [105]: df
Out[105]:
Color Value Counts
0 Red 100 2
1 Red 150 2
2 Blue 50 1
Note that transform('count')
ignores NaNs. If you want to count NaNs, use transform(len)
.
To the anonymous editor: If you are getting an error while using transform('count')
it may be due to your version of Pandas being too old. The above works with pandas version 0.15 or newer.
Upvotes: 85