Reputation: 13

New column in pandas DataFrame which counts occurrences of a column value for all values below

I'm working with a fairly large dataset (500k rows) and would like to compute the running occurrences of IDs.

The dataset consists of one column with IDs and one column with total occurrences of each ID. I would like to add a new column which lists how often each ID appears in the column below the current row. Essentially, I want to add a python code which computes for each row "Occurences after" as shown in the screenshot below.

Total occurences -> Excel comparison

Occurences after -> Excel comparison

I tried looping over a subset of the dataframe but this is very slow (and also not a best practice at all, as I read). I think the apply method would be much more computationally efficient, however I can't think of a function that takes into account the current row position to ensure that only values below the current row are taken into account.

Upvotes: 1

Answers (3)

browntorso

Reputation: 1

import pandas as pd
id_df = pd.read_csv(file_path)
total_occ = []
occ_after = []
for i in range(len(id_df['ID'])):
    total_occ.append(id_df['ID'].value_counts()[id_df['ID'][i]])
    occ_after.append(list(id_df['ID'][i+1:]).count(id_df['ID'][i])+1) 
id_df['Total Occurences'] = total_occ
id_df['Occurences After'] = occ_after
print(id_df)
ID  Total Occurences    Occurences After
0   C101    3   3
1   D101    4   4
2   C101    3   2
3   D101    4   3
4   E101    2   2
5   F101    1   1
6   D101    4   2
7   C101    3   1
8   D101    4   1
9   E101    2   1

Upvotes: 0

Valdi_Bo

Reputation: 30971

Use the following code:

df['Total Occ'] = df.groupby('ID').ID.transform('count')
df['Occ After'] = df[::-1].groupby('ID').cumcount() + 1

(I shortened column names).

As you can see, just 2 one-liners (one for each column), without any loops (like in the other answer).

The result, for your data sample, is:

     ID  Total Occ  Occ After
0  C101          3          3
1  D101          4          4
2  C101          3          2
3  D101          4          3
4  E101          2          2
5  F101          1          1
6  D101          4          2
7  C101          3          1
8  D101          4          1
9  E101          2          1

Upvotes: 1

Matheus Rocha

Reputation: 83

ElPadrino, I've created an Dataframe containing your ID column to simulate your problem. To not loop over all the DataFrame, let's loop over the it's unique values. So, we set to all rows that ID column equals the unique value, and get the indexes of the occurrences.

With the indexes in hand, we can iterate and count how many occurrences we have when the indexes is equal or higher than index.

In this example, I've calculated the Total Occurrences column to have an output exactly like in your images.

The pandas.DataFrame.loc documentation could help you to understand it better, but basically we can access the DataFrame exactly where we want giving the index.

df.loc[index, 'column']

df['Total Occurences'] = ''
df['Occurences after'] = ''

for i in df['ID'].unique():
    df.loc[df['ID'] == i, 'Total Occurences'] = df[df['ID'] == i].value_counts().values[0]
    indexes = df[df['ID'] == i].index.values
    for index in indexes:
        df.loc[index, 'Occurences after'] = df[(df.index >= index) & (df['ID'] == i)].value_counts().values[0]

Here's the output:

Upvotes: 0

New column in pandas DataFrame which counts occurrences of a column value for all values below

Answers (3)

Related Questions