Reputation: 3177
I have my data in pandas data frame as follows:
df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],
'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})
So, my data looks like this
----------------------------
index A B
0 yes yes
1 yes no
2 yes no
3 yes no
4 no yes
5 no yes
6 yes no
7 yes yes
8 yes yes
9 no no
-----------------------------
I would like to transform it to another data frame. The expected output can be shown in the following python script:
output = pd.DataFrame({'A':['no','no','yes','yes'],'B':['no','yes','no','yes'],'count':[1,2,4,3]})
So, my expected output looks like this
--------------------------------------------
index A B count
--------------------------------------------
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
--------------------------------------------
Actually, I can achieve to find all combinations and count them by using the following command: mytable = df1.groupby(['A','B']).size()
However, it turns out that such combinations are in a single column. I would like to separate each value in a combination into different column and also add one more column for the result of counting. Is it possible to do that? May I have your suggestions? Thank you in advance.
Upvotes: 219
Views: 309724
Reputation: 313
Based on the accepted answer and @Bryan P's comment relating to the differences between count() and size(), I opted for count() for cleaner code as below :
df1.groupby(['A','B']).count().reset_index()
Upvotes: 3
Reputation: 17794
In Pandas 1.1.0 you can use the method value_counts
with DataFrames:
df.value_counts() # or df[['A', 'B']].value_counts()
Result:
A B
yes no 4
yes 3
no yes 2
no 1
dtype: int64
Convert index to columns and sort by value counts:
df.value_counts(ascending=True).reset_index(name='count')
Result:
A B count
0 no no 1
1 no yes 2
2 yes yes 3
3 yes no 4
Upvotes: 54
Reputation: 11379
Placing @EdChum's very nice answer into a function count_unique_index
.
The unique method only works on pandas series, not on data frames.
The function below reproduces the behavior of the unique function in R:
unique returns a vector, data frame or array like x but with duplicate elements/rows removed.
And adds a count of the occurrences as requested by the OP.
def count_unique_index(df, by):
return df.groupby(by).size().reset_index().rename(columns={0:'count'})
df1 = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],
'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})
count_unique_index(df1, ['A','B'])
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
Upvotes: -1
Reputation: 1161
I haven't done time test with this but it was fun to try. Basically convert two columns to one column of tuples. Now convert that to a dataframe, do 'value_counts()' which finds the unique elements and counts them. Fiddle with zip again and put the columns in order you want. You can probably make the steps more elegant but working with tuples seems more natural to me for this problem
b = pd.DataFrame({'A':['yes','yes','yes','yes','no','no','yes','yes','yes','no'],'B':['yes','no','no','no','yes','yes','no','yes','yes','no']})
b['count'] = pd.Series(zip(*[b.A,b.B]))
df = pd.DataFrame(b['count'].value_counts().reset_index())
df['A'], df['B'] = zip(*df['index'])
df = df.drop(columns='index')[['A','B','count']]
Upvotes: 1
Reputation: 1427
Slightly related, I was looking for the unique combinations and I came up with this method:
def unique_columns(df,columns):
result = pd.Series(index = df.index)
groups = meta_data_csv.groupby(by = columns)
for name,group in groups:
is_unique = len(group) == 1
result.loc[group.index] = is_unique
assert not result.isnull().any()
return result
And if you only want to assert that all combinations are unique:
df1.set_index(['A','B']).index.is_unique
Upvotes: 1
Reputation: 393873
You can groupby
on cols 'A' and 'B' and call size
and then reset_index
and rename
the generated column:
In [26]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[26]:
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
update
A little explanation, by grouping on the 2 columns, this groups rows where A and B values are the same, we call size
which returns the number of unique groups:
In[202]:
df1.groupby(['A','B']).size()
Out[202]:
A B
no no 1
yes 2
yes no 4
yes 3
dtype: int64
So now to restore the grouped columns, we call reset_index
:
In[203]:
df1.groupby(['A','B']).size().reset_index()
Out[203]:
A B 0
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
This restores the indices but the size aggregation is turned into a generated column 0
, so we have to rename this:
In[204]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[204]:
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
groupby
does accept the arg as_index
which we could have set to False
so it doesn't make the grouped columns the index, but this generates a series
and you'd still have to restore the indices and so on....:
In[205]:
df1.groupby(['A','B'], as_index=False).size()
Out[205]:
A B
no no 1
yes 2
yes no 4
yes 3
dtype: int64
Upvotes: 330