Reputation: 4200
I have a pandas data frame in which I want to simplify duplicates (on one column, here the first) by adding a 'count' column (here the last column, preset with "1" for the line I am in). My data frame looks like this:
df = pandas.DataFrame([["a", ..., 1], # last row always 1 (this will be the 'count' column
["a", ..., 1], #"a" = identical, other values not necessarily
["b", ..., 1],
["c", ..., 1],
["a", ..., 1]
["d", ..., 1],
["d", ..., 1]])
Note that what I'm interested in is the first column with letters reoccurring. The other columns are not necessarily duplicates but can be ignored here. I want to go through the data frame line by line and do the following:
I'm not sure if the best way to do this would be in the same data frame or in a new one but would like to end up with the df like this:
df2 = pandas.DataFrame([["a", ..., 3], # no changes except for last column counting three instances of "a": this line and two further lines
# line deleted: "a" reoccurs
["b", ..., 1], # no changes
["c", ..., 1], # no changes
# line deleted: "a" reoccurs
["d", ..., 2], # no changes except last column counting two instances of "d": this line and one more
]) # line deleted: "d" reoccurs
I don't really know how to go about this and am hoping for some suggestions. Thanks in advance!
Upvotes: 2
Views: 3933
Reputation: 11424
The following code
import pandas as pd
df = pd.DataFrame({"first":["a", "b", "b", "a", "b", "c"], "second":range(6)})
result = df.groupby('first').first()
result['count'] = df['first'].value_counts()
result.reset_index(inplace=True)
creates the dataframe
first second
0 a 0
1 b 1
2 b 2
3 a 3
4 b 4
5 c 5
and turns it into
first second count
0 a 0 2
1 b 1 3
2 c 5 1
This is exactly what you need.
UPDATE. In the comments, you asked how to apply different aggregations to different columns. This is an example
import pandas as pd
df = pd.DataFrame({"first":["a", "b", "b", "a", "b", "c"],
"second":range(6), 'third': range(6)})
result = df.groupby('first').agg({'second': lambda x: x.iloc[0], 'third': max})
result['count'] = df['first'].value_counts()
result.reset_index(inplace=True)
which produces
first second third count
0 a 0 3 2
1 b 1 4 3
2 c 5 5 1
so that the second
and third
columns are aggregated differently.
Upvotes: 2
Reputation: 323236
Data from David
df.groupby('first').agg({'first':'count','second':'first'}).rename(columns={'first':'count'})
Out[1177]:
count second
first
a 2 0
b 3 1
c 1 5
Upvotes: 1