martin
martin

Reputation: 1185

How to optimize searching and comparing rows in pandas?

I have a two dfs. Base is 100k rows, Snps is 54k rows.

This is structure of dfs:

base:

SampleNum SampleIdInt   SecondName
  1         ASA2123313  A2123313
  2         ARR4112234  R4112234
  3         AFG4234122  G4234122
  4         GGF412233   F412233
  5         GTF423512   F423512
  6         POL23523552 L23523552
...

And this is Snps df:

   SampleNum SampleIdInt
      1         ART2114155
      2         KWW4112234
      3         AFG4234122 
      4         GGR9999999
      5         YUU33434324
      6         POL23523552
    ...

And now look for example on 2nd row in Snps and base. They have a same numbers (the first 3 chars are not important to me now).

So I created a commonlist contained a numbers from snsp which also appear in the base. The all rows with SAME numbers between dfs. (common has 15k length)

common_list = [4112234, 4234122, 23523552]

And now I want create three new lists.

confirmedSnps = where whole SampleIdInt is identical as in base. In this example: AFG4234122. For this I have a sure that secondName will be proper.

un_comfirmedSnpS = where I have a good number but first three chars are different. Example: KWW4112234 in SnpS and ARR4112234 in base. In this case, I'm not sure that SecondName is proper, so I need to check it later.

And last moreThanOne list. That list should append all duplicate rows. For example If in base I will have KWW4112234 and AFG4112234 both should go to that list.

I wrote a code. It's work fine, but the problem is time. I got 15k elements to filter, and each element processing 4 second. It's mean whole loop will be run for 17h! I looking for help in optimization that code.

That's my code:

comfirmedSnps = []
un_comfirmedSnps = []
moreThanOne = []
for i in range(len(common)):
    testa = baza[baza['SampleIdInt'].str.contains(common[i])]
    testa = testa.SampleIdInt.unique()
    print("StepOne")
    testb = snps[snps['SampleIdInt'].str.contains(common[i])]
    testb = testb.SampleIdInt.unique()
    print("StepTwo") 
    if len(testa) == 1 and len(testb) == 1:
        if (testa == testb) == True:
            comfirmedSnps.append(testb)
        else:
            un_comfirmedSnps.append(testb)        
    else:
        print("testa has more than one contains records. ")
        moreThanOne.append(testb)
    print("StepTHREE")
    print(i,"/", range(len(common)))

I added a Steps prints to check which part takes most of the time. It's the code between StepOne and stepTwo. First and third steps are running instant.

Can someone help me with that case? For sure most of U will see better solution to this problem.

Upvotes: 0

Views: 48

Answers (1)

BlackBear
BlackBear

Reputation: 22979

What you are trying to do is commonly called join, which annoyingly enough is called merge in pandas. There's just the minor annoyance of the three initial letters to deal with, but that's easy:

snps.numeric_id = snps.SampleIdInt.apply(lambda s: s[3:])
base.numeric_id = base.SampleIdInt.apply(lambda s: s[3:])

now you can compute the three dataframes:

confirmed = snps.merge(base, on='SampleIdInt')
unconfirmed = snps.merge(
    base, on='numeric_id'
).filter(
    lambda r.SampleIdInt_x != r.SampleIdInt_y
)
more_than_one = snps.group_by('numeric_id').filter(lambda g: len(g) > 1)

I bet it won't work, but hopefully you get the idea.

Upvotes: 1

Related Questions