Ivo
Ivo

Reputation: 4200

pandas data frame: adding 'count' column for multiple occurrences on one column/ duplicates

I have a pandas data frame in which I want to simplify duplicates (on one column, here the first) by adding a 'count' column (here the last column, preset with "1" for the line I am in). My data frame looks like this:

df = pandas.DataFrame([["a", ..., 1], # last row always 1 (this will be the 'count' column
                       ["a", ..., 1], #"a" = identical, other values not necessarily
                       ["b", ..., 1],
                       ["c", ..., 1],
                       ["a", ..., 1]
                       ["d", ..., 1],
                       ["d", ..., 1]])

Note that what I'm interested in is the first column with letters reoccurring. The other columns are not necessarily duplicates but can be ignored here. I want to go through the data frame line by line and do the following:

I'm not sure if the best way to do this would be in the same data frame or in a new one but would like to end up with the df like this:

df2 = pandas.DataFrame([["a", ..., 3], # no changes except for last column counting three instances of "a": this line and two further lines
                                       # line deleted: "a" reoccurs
                       ["b", ..., 1],  # no changes
                       ["c", ..., 1],  # no changes
                                       # line deleted:  "a" reoccurs
                       ["d", ..., 2],  # no changes except last column counting two instances of "d": this line and one more
                                   ])  # line deleted:  "d" reoccurs

I don't really know how to go about this and am hoping for some suggestions. Thanks in advance!

Upvotes: 2

Views: 3933

Answers (2)

David Dale
David Dale

Reputation: 11424

The following code

import pandas as pd
df = pd.DataFrame({"first":["a", "b", "b", "a", "b", "c"], "second":range(6)})
result = df.groupby('first').first()
result['count'] = df['first'].value_counts()
result.reset_index(inplace=True)

creates the dataframe

  first  second
0     a       0
1     b       1
2     b       2
3     a       3
4     b       4
5     c       5

and turns it into

  first  second  count
0     a       0      2
1     b       1      3
2     c       5      1

This is exactly what you need.

UPDATE. In the comments, you asked how to apply different aggregations to different columns. This is an example

import pandas as pd
df = pd.DataFrame({"first":["a", "b", "b", "a", "b", "c"], 
                   "second":range(6), 'third': range(6)})
result = df.groupby('first').agg({'second': lambda x: x.iloc[0], 'third': max})
result['count'] = df['first'].value_counts()
result.reset_index(inplace=True)

which produces

  first  second  third  count
0     a       0      3      2
1     b       1      4      3
2     c       5      5      1

so that the second and third columns are aggregated differently.

Upvotes: 2

BENY
BENY

Reputation: 323236

Data from David

df.groupby('first').agg({'first':'count','second':'first'}).rename(columns={'first':'count'})
Out[1177]: 
       count  second
first               
a          2       0
b          3       1
c          1       5

Upvotes: 1

Related Questions