Reputation: 67
I have a very large dataset (10 GB) in csv format with various columns and rows. One of the columns is IDs (represented as strings) of some class of individuals. The IDs are all scrambled in the data, and each individual ID may occur more than once. I'd like to find the ID of the individual that occurs most frequently in the data. Ideally, I would like a count of how many times each ID occurs in the dataset. Eventually I'd also like to do statistical analysis on the individual ids. Whats the fastest way to accomplish this. I did try groupby, but don't know how to find the ID corresponding to the groups, and their size.
import pandas as pd
df = pd.read_csv('file')
user_groups = df.groupby(['IDs'])
Upvotes: 0
Views: 932
Reputation: 429
You can use value_counts of pandas.
value_counts: Returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
df.ID.value_counts()
It is possible to use group by too as you said on the question, and that is better when you are trying to do a simple count of frequency:
df.groupby(['ID'], sort=True).size()
It's possible to see the computacional cost of each approach and the explanation of why one method is fasted than other on this answer. But a TLDR of the answer:
m grouper counter
10 62.9 ms 315 ms
10**3 191 ms 535 ms
10**7 514 ms 459 ms
This video shows how some operations are made inside pandas, so if you want to understand which method uses more memory or takes more operations it's useful.
Upvotes: 3
Reputation: 182
This has been answered in here Most frequent value if you just want to find the mode
If you want to break down by group then try this :
df_2= df.groupby('Ids').count().reset_index()
df_2 = df_2.sort_values('replicate',ascending = False)
You can add rename one of the columns to Frequency:
df_2 = df_2.rename(columns = {'Column Name':'new name'})
Upvotes: 0
Reputation: 111
I'm not sure if this is the sort of answer you were looking for but if it was me I would be looking at making a dictionary of tuples from the ID's like this {'ID Number' : 'Count', '1234ABC' : 0, ...]
Then I would use readlines() to read each line of the large dataset and if any of the ID's appear in that line of the data update the count in the list of tuples.
Once it has finished reading all the lines of the dataset then find the max value for 'Count' and return it's associated ID number.
You could even do a sort on the list to sort them from most occurrences of an ID number to the least.
Upvotes: 0
Reputation: 1493
import pandas as pd
df = pd.read_csv('file')
user_groups = df.groupby('IDs', as_index=False).count()
user_groups.sort_values([yourcountedcolumn], ascending=True/False)
Upvotes: 1