deserthiker
deserthiker

Reputation: 67

Identifying the most frequently occurring value (string) in a column

I have a very large dataset (10 GB) in csv format with various columns and rows. One of the columns is IDs (represented as strings) of some class of individuals. The IDs are all scrambled in the data, and each individual ID may occur more than once. I'd like to find the ID of the individual that occurs most frequently in the data. Ideally, I would like a count of how many times each ID occurs in the dataset. Eventually I'd also like to do statistical analysis on the individual ids. Whats the fastest way to accomplish this. I did try groupby, but don't know how to find the ID corresponding to the groups, and their size.

import pandas as pd
df = pd.read_csv('file')
user_groups = df.groupby(['IDs'])

Upvotes: 0

Views: 932

Answers (4)

Lucas Araújo
Lucas Araújo

Reputation: 429

You can use value_counts of pandas.

value_counts: Returns object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

df.ID.value_counts()

It is possible to use group by too as you said on the question, and that is better when you are trying to do a simple count of frequency:

df.groupby(['ID'], sort=True).size()

It's possible to see the computacional cost of each approach and the explanation of why one method is fasted than other on this answer. But a TLDR of the answer:

m       grouper   counter
10      62.9 ms    315 ms
10**3    191 ms    535 ms
10**7    514 ms    459 ms

This video shows how some operations are made inside pandas, so if you want to understand which method uses more memory or takes more operations it's useful.

Upvotes: 3

Ryan Ni
Ryan Ni

Reputation: 182

This has been answered in here Most frequent value if you just want to find the mode

If you want to break down by group then try this :

df_2= df.groupby('Ids').count().reset_index()
df_2 = df_2.sort_values('replicate',ascending = False)

You can add rename one of the columns to Frequency:

df_2 = df_2.rename(columns = {'Column Name':'new name'})

Upvotes: 0

David Fawcett
David Fawcett

Reputation: 111

I'm not sure if this is the sort of answer you were looking for but if it was me I would be looking at making a dictionary of tuples from the ID's like this {'ID Number' : 'Count', '1234ABC' : 0, ...]

Then I would use readlines() to read each line of the large dataset and if any of the ID's appear in that line of the data update the count in the list of tuples.

Once it has finished reading all the lines of the dataset then find the max value for 'Count' and return it's associated ID number.

You could even do a sort on the list to sort them from most occurrences of an ID number to the least.

Upvotes: 0

YusufUMS
YusufUMS

Reputation: 1493

import pandas as pd
df = pd.read_csv('file')
user_groups = df.groupby('IDs', as_index=False).count()
user_groups.sort_values([yourcountedcolumn], ascending=True/False)

Upvotes: 1

Related Questions