pandas data frame: adding 'count' column for multiple occurrences on one column/ duplicates

Question

I have a pandas data frame in which I want to simplify duplicates (on one column, here the first) by adding a 'count' column (here the last column, preset with "1" for the line I am in). My data frame looks like this:

df = pandas.DataFrame([["a", ..., 1], # last row always 1 (this will be the 'count' column
                       ["a", ..., 1], #"a" = identical, other values not necessarily
                       ["b", ..., 1],
                       ["c", ..., 1],
                       ["a", ..., 1]
                       ["d", ..., 1],
                       ["d", ..., 1]])

Note that what I'm interested in is the first column with letters reoccurring. The other columns are not necessarily duplicates but can be ignored here. I want to go through the data frame line by line and do the following:

on the first occurrence of an instance in the first column (e.g. in the first column, "a" appears the first time), check that the value of the last column of this line is exactly one - set to one if it isn't.
on the second occurrence of this same instance (e.g. in line two, where "a" occurs again): drop this line and add +1 to the value of the last column in the row where this instance occurred the first time.

I'm not sure if the best way to do this would be in the same data frame or in a new one but would like to end up with the df like this:

df2 = pandas.DataFrame([["a", ..., 3], # no changes except for last column counting three instances of "a": this line and two further lines
                                       # line deleted: "a" reoccurs
                       ["b", ..., 1],  # no changes
                       ["c", ..., 1],  # no changes
                                       # line deleted:  "a" reoccurs
                       ["d", ..., 2],  # no changes except last column counting two instances of "d": this line and one more
                                   ])  # line deleted:  "d" reoccurs

I don't really know how to go about this and am hoping for some suggestions. Thanks in advance!

David Dale · Accepted Answer

The following code

import pandas as pd
df = pd.DataFrame({"first":["a", "b", "b", "a", "b", "c"], "second":range(6)})
result = df.groupby('first').first()
result['count'] = df['first'].value_counts()
result.reset_index(inplace=True)

creates the dataframe

  first  second
0     a       0
1     b       1
2     b       2
3     a       3
4     b       4
5     c       5

and turns it into

  first  second  count
0     a       0      2
1     b       1      3
2     c       5      1

This is exactly what you need.

UPDATE. In the comments, you asked how to apply different aggregations to different columns. This is an example

import pandas as pd
df = pd.DataFrame({"first":["a", "b", "b", "a", "b", "c"], 
                   "second":range(6), 'third': range(6)})
result = df.groupby('first').agg({'second': lambda x: x.iloc[0], 'third': max})
result['count'] = df['first'].value_counts()
result.reset_index(inplace=True)

which produces

  first  second  third  count
0     a       0      3      2
1     b       1      4      3
2     c       5      5      1

so that the second and third columns are aggregated differently.

pandas data frame: adding 'count' column for multiple occurrences on one column/ duplicates

Answers (2)

Related Questions

pandas data frame: adding &#39;count&#39; column for multiple occurrences on one column/ duplicates

Answers (2)

Related Questions

pandas data frame: adding 'count' column for multiple occurrences on one column/ duplicates