Reputation: 313

How to count the occurrence of values in one pandas Dataframe if the values to count are in another (in a faster way)?

I have a (really big) pandas Dataframe df:

country  age  gender
Brazil    10     F
USA       20     F 
Brazil    10     F
USA       20     M
Brazil    10     M
USA       20     M

I have another pandas Dataframe freq:

 age  gender  counting
  10       F         0
  10       M         0
  20       F         0

I wanna count the pair of values in freq when they occur in df:

 age  gender  counting
  10       F         2
  10       M         1
  20       F         1

I'm using this code, but it takes too long:

for row in df.itertuples(index=False):
   freq.loc[np.all(freq['age','gender']==row[2:3],axis=1),'counting'] += 1

Is there a faster way to do that?

Please note:

I have to use freq because not all combinations (as for instance 20 and M) are desired
some columns in df may not be used
counting counts how many times both values appear in each row
freq may have more than 2 values to check for (this is just an small example)

Upvotes: 9

Answers (3)

Scott Boston

Reputation: 153460

Another way is to use reindex to filter down to freq list:

df.groupby(['gender', 'age']).count()\
  .reindex(pd.MultiIndex.from_arrays([df1['gender'], df1['age']]))

Output:

            country
gender age         
F      10         2
M      10         1
F      20         1

Upvotes: 8

Divakar

Reputation: 221524

NumPy into the mix for some performance (hopefully!) with the idea of dimensionality-reduction to 1D, so that we can bring in the efficient bincount -

agec = np.r_[df.age,freq.age]
genderc = np.r_[df.gender,freq.gender]
aIDs,aU = pd.factorize(agec)
gIDs,gU = pd.factorize(genderc)
cIDs = aIDs*(gIDs.max()+1) + gIDs
count = np.bincount(cIDs[:len(df)], minlength=cIDs.max()+1)
freq['counting'] = count[cIDs[-len(freq):]]

Sample run -

In [44]: df
Out[44]: 
  country  age gender
0  Brazil   10      F
1     USA   20      F
2  Brazil   10      F
3     USA   20      M
4  Brazil   10      M
5     USA   20      M

In [45]: freq # introduced a missing element as the second row for variety
Out[45]: 
   age gender  counting
0   10      F         2
1   23      M         0
2   20      F         1

Specific scenario optimization #1

If age header is known to contain only integers, we can skip one factorize. So, skip aIDs,aU = pd.factorize(agec) and compute cIDs instead with -

cIDs = agec*(gIDs.max()+1) + gIDs

Upvotes: 8

Ben.T

Reputation: 29635

you can do it with inner merge to filter the combinations in df you don't want, then groupby age and gender and count the column counting. just reset_index to fit your expected output.

freq = (df.merge(freq, on=['age', 'gender'], how='inner')
          .groupby(['age','gender'])['counting'].size()
          .reset_index())
print (freq)
   age gender  counting
0   10      F         2
1   10      M         1
2   20      F         1

Depending on the number of combinations you don't want, it could be faster to groupby on df before doing the merge like:

freq = (df.groupby(['age','gender']).size()
          .rename('counting').reset_index()
          .merge(freq[['age','gender']])
       )

Upvotes: 10

How to count the occurrence of values in one pandas Dataframe if the values to count are in another (in a faster way)?

Answers (3)

Related Questions