raven
raven

Reputation: 527

Comparing multiple columns of a massive DataFrame with complex duplicate rows

I have a massive dataframe df with around 10 million rows:

df.sort_values(['pair','x1','x2'])


x1    x1gen    x2     x2gen     y1     y1gen     y2     y2gen      pair
-------------------------------------------------------------------------------
A     male     H      female    a      male      d      male       0
A     male     W      male      a      male      d      male       0         (*)
A     male     KK     female    a      male      d      male       0         (**)
B     female   C      male      a      male      d      male       0         (-)
B     female   W      male      a      male      d      male       0         (*)
B     female   BB     female    a      male      d      male       0
B     female   KK     female    a      male      d      male       0         (**)
F     male     W      male      a      male      d      male       0         (*)
A     male     T      female    b      female    d      male       1
A     male     BB     female    b      female    d      male       1
B     female   C      male      b      female    d      male       1         (-)
D     male     E      male      b      female    d      male       1
A     male     C      male      b      female    e      female     2
...

Each column can be explained by the following:

My objective is to find four values per unique pair:

  1. male citing male
  2. male citing female
  3. female citing male
  4. female citing female

where, each citation network should not be counted more than once.

For example, in the given sample, x2 = W is appeared three times in pair = 0 (see (*)), so it should be counted once, not three times. Same applies to x2 = KK in pair = 0 (see (**)). However, we can count the same reference if it is a new pair. (C -> d in (-) is counted separately once per pair = 0 and pair = 1)

Hence, for the first pair pair = 0, the objective values are:

  1. male citing male = 4 (A -> a, F -> a, W -> d, C -> d)
  2. male citing female = 0
  3. female citing male = 4 (B -> a, H -> d, KK -> d, BB -> d)
  4. female citing female = 0

What I initially did was using a for loop and a set of if loops and creating four lists separately for x1 and x2:

mm = [1]
mf = [0]
fm = [0]
ff = [0]

mm1 = 1
mf1 = 0
fm1 = 0
ff1 = 0

for i in range(1, len(df)):
  if df['pair'][i] == df['pair'][i-1]:
    if df['x1'][i] != df['x1'][i-1]:
      if df['x1gen'][i] == 'male':
        if df['y1gen'][i] == 'male':
          mm1 += 1
        else:
          mf1 += 1
      else:
        if df['y1gen'][i] == 'male':
          fm1 += 1
        else:
          ff1 += 1
...

and the gist is analogous (the code itself is MANY lines long, but just a repetition of those lines). As one can tell, this is HIGHLY inefficient (takes around 120 minutes).

What is the optimal way to find such values without having to do a highly inefficient string-matching?

Upvotes: 0

Views: 63

Answers (1)

bb1
bb1

Reputation: 7863

You can try the following:

import io  
import re
import pandas as pd

# this just recreates the dataframe 
s = '''
x1    x1gen    x2     x2gen     y1     y1gen     y2     y2gen      pair
A     male     H      female    a      male      d      male       0
A     male     W      male      a      male      d      male       0
A     male     KK     female    a      male      d      male       0
B     female   C      male      a      male      d      male       0
B     female   W      male      a      male      d      male       0
B     female   BB     female    a      male      d      male       0
B     female   KK     female    a      male      d      male       0
F     male     W      male      a      male      d      male       0
A     male     T      female    b      female    d      male       1
A     male     BB     female    b      female    d      male       1
B     female   C      male      b      female    d      male       1
D     male     E      male      b      female    d      male       1
A     male     C      male      b      female    e      female     2
'''

s = re.sub(r" +", " ", s)
df = pd.read_csv(io.StringIO(s), sep=" ")
print(df)

It gives:

   x1   x1gen  x2   x2gen y1   y1gen y2   y2gen  pair
0   A    male   H  female  a    male  d    male     0
1   A    male   W    male  a    male  d    male     0
2   A    male  KK  female  a    male  d    male     0
3   B  female   C    male  a    male  d    male     0
4   B  female   W    male  a    male  d    male     0
5   B  female  BB  female  a    male  d    male     0
6   B  female  KK  female  a    male  d    male     0
7   F    male   W    male  a    male  d    male     0
8   A    male   T  female  b  female  d    male     1
9   A    male  BB  female  b  female  d    male     1
10  B  female   C    male  b  female  d    male     1
11  D    male   E    male  b  female  d    male     1
12  A    male   C    male  b  female  e  female     2

Counting citation pairs:

# count x1-> y1 pairs
df1 = df.drop_duplicates(subset=['x1', 'y1', 'pair'])
c1 = (df1['x1gen'] + '_' + df1['y1gen']).value_counts()

# count x2-> y2 pairs
df2 = df.drop_duplicates(subset=['x2', 'y2', 'pair'])
c2 = (df2['x2gen'] + '_' + df2['y2gen']).value_counts()

# add results
c1.add(c2, fill_value=0).astype(int)

This gives:

female_female    1
female_male      6
male_female      4
male_male        6

Computing results for each pair separately:

def cit_count(g):

    # count x2-> y2 pairs
    df1 = g.drop_duplicates(subset=['x1', 'y1'])
    c1 = (df1['x1gen'] + '_' + df1['y1gen']).value_counts()

    # count x2-> y2 pairs
    df2 = g.drop_duplicates(subset=['x2', 'y2'])
    c2 = (df2['x2gen'] + '_' + df2['y2gen']).value_counts()

    # add results
    return c1.add(c2, fill_value=0)

print(df.groupby('pair').apply(cit_count).unstack().fillna(0).astype(int))

It gives:

      female_female  female_male  male_female  male_male
pair                                                    
0                 0            4            0          4
1                 1            2            2          2
2                 0            0            2          0

Upvotes: 1

Related Questions