Reputation: 13
I'm working with a fairly large dataset (500k rows) and would like to compute the running occurrences of IDs.
The dataset consists of one column with IDs and one column with total occurrences of each ID. I would like to add a new column which lists how often each ID appears in the column below the current row. Essentially, I want to add a python code which computes for each row "Occurences after" as shown in the screenshot below.
Total occurences -> Excel comparison
Occurences after -> Excel comparison
I tried looping over a subset of the dataframe but this is very slow (and also not a best practice at all, as I read). I think the apply method would be much more computationally efficient, however I can't think of a function that takes into account the current row position to ensure that only values below the current row are taken into account.
Upvotes: 1
Views: 3291
Reputation: 1
import pandas as pd
id_df = pd.read_csv(file_path)
total_occ = []
occ_after = []
for i in range(len(id_df['ID'])):
total_occ.append(id_df['ID'].value_counts()[id_df['ID'][i]])
occ_after.append(list(id_df['ID'][i+1:]).count(id_df['ID'][i])+1)
id_df['Total Occurences'] = total_occ
id_df['Occurences After'] = occ_after
print(id_df)
ID Total Occurences Occurences After
0 C101 3 3
1 D101 4 4
2 C101 3 2
3 D101 4 3
4 E101 2 2
5 F101 1 1
6 D101 4 2
7 C101 3 1
8 D101 4 1
9 E101 2 1
Upvotes: 0
Reputation: 30971
Use the following code:
df['Total Occ'] = df.groupby('ID').ID.transform('count')
df['Occ After'] = df[::-1].groupby('ID').cumcount() + 1
(I shortened column names).
As you can see, just 2 one-liners (one for each column), without any loops (like in the other answer).
The result, for your data sample, is:
ID Total Occ Occ After
0 C101 3 3
1 D101 4 4
2 C101 3 2
3 D101 4 3
4 E101 2 2
5 F101 1 1
6 D101 4 2
7 C101 3 1
8 D101 4 1
9 E101 2 1
Upvotes: 1
Reputation: 83
ElPadrino, I've created an Dataframe containing your ID column to simulate your problem. To not loop over all the DataFrame, let's loop over the it's unique values. So, we set to all rows that ID column equals the unique value, and get the indexes of the occurrences.
With the indexes in hand, we can iterate and count how many occurrences we have when the indexes is equal or higher than index.
In this example, I've calculated the Total Occurrences column to have an output exactly like in your images.
The pandas.DataFrame.loc documentation could help you to understand it better, but basically we can access the DataFrame exactly where we want giving the index.
df.loc[index, 'column']
df['Total Occurences'] = ''
df['Occurences after'] = ''
for i in df['ID'].unique():
df.loc[df['ID'] == i, 'Total Occurences'] = df[df['ID'] == i].value_counts().values[0]
indexes = df[df['ID'] == i].index.values
for index in indexes:
df.loc[index, 'Occurences after'] = df[(df.index >= index) & (df['ID'] == i)].value_counts().values[0]
Here's the output:
Upvotes: 0