SeanTomlinson30
SeanTomlinson30

Reputation: 33

Optimizing a dataframe subset operation in Python

Summarize the Problem

I am trying to optimize some code I have written. In its current form it works as intended, however because of the sheer number of loops required the script it takes a very long time to run.

I'm looking for a method of speeding up the below-described code.

Detail the problem

Within this data frame called master, there are 3,936,192 rows. The Position column represents a genomic window. Which is present in this data frame 76 times. Such that master[master['Position'] == 300] returns a dataframe of 76 rows, and similar for each unique appearance of Position. I do some operations on each of these subsets of the data frame.

The data can be found here

My current code takes the form:

import pandas as pd

master = pd.read_csv(data_location)

windows = sorted(set(master['Position']))

window_factor = []

               # loop through all the windows, look at the cohort of samples, ignore anything not CNV == 2
               # if that means ignore all, then drop the window entirely
               # else record the 1/2 mean of that windows normalised coverage across all samples. 

for window in windows:
    current_window = master[master['Position'] == window]

    t = current_window[current_window['CNV'] == 2]

    if t.shape[0] == 0:
        window_factor.append('drop')

    else:
        window_factor.append(
            np.mean(current_window[current_window['CNV'] == 2]['Normalised_coverage'])/2)

However, this takes an exceptionally long time to run and I can't figure out a way to speed this up, though I know there must be one.

Upvotes: 2

Views: 232

Answers (4)

rpanai
rpanai

Reputation: 13447

your df is not that big and in your code there are few problems:

  • If you use np.mean and one value is np.nan it returns np.nan
  • You can divide by 2 after calculate the mean.
  • It seems to me a perfect case for groupby
  • Return a string while other results are float you might consider to use np.nan instead
import pandas as pd

df = pd.read_csv("master.csv")

def fun(x):
    t = x[x["CNV"]==2]
    return t["Normalised_coverage"].mean()/2

# returns np.nan when len(t)==0
out = df.groupby('Position').apply(fun)
CPU times: user 34.7 s, sys: 72.5 ms, total: 34.8 s
Wall time: 34.7 s

Or even faster filtering before the groupby as

%%time
out = df[df["CNV"]==2].groupby("Position")["Normalised_coverage"].mean()/2

CPU times: user 82.5 ms, sys: 8.03 ms, total: 90.5 ms
Wall time: 87.8 ms

UPDATE: In the last case if you really need to keep track of groups where df["CNV"]!=2 you can use this code:

import numpy as np
bad = df[df["CNV"]!=2]["Position"].unique()
bad = list(set(bad)-set(out.index))

out = out.reset_index(name="value")

out1 = pd.DataFrame({"Position":bad,
                     "value":[np.nan]*len(bad)})

out = pd.concat([out,out1],
                ignore_index=True)\
        .sort_values("Position")\
        .reset_index(drop=True)

Which is going to add 160ms to your computation.

Upvotes: 2

SeanTomlinson30
SeanTomlinson30

Reputation: 33

Using groupby and query was the solution I went with.

import pandas as pd
import numpy as np

master = pd.read_csv("/home/sean/Desktop/master.csv", index_col=0)

windows = sorted(set(master['Position']))

g = master.groupby("Position")

master.query("Position == 24386700").shape

g = master.query("CNV == 2").groupby("Position")

p = g.Normalised_coverage.mean() / 2

Upvotes: 0

Ysh Xiong
Ysh Xiong

Reputation: 77

I think .groupby() function is what you need here:

fac = []
for name,group in master.groupby('Position'):
    if all(group['CNV'] != 2):
        fac.append('drop')
    else:
        fac.append(np.mean(group[group['CNV'] == 2]['Normalised_coverage'])/2)


I downloaded your data master.csv, data generated is exactly the same, running time decreased from 6 min to 30 sec on my laptop. Hope it helps.

Upvotes: 1

marcos
marcos

Reputation: 4510

You can do several things:

  • instead of using a python list for window_factor consider using a np.array since you know the length of the array.
  • t is already current_window[current_window['CNV'] == 2] use t when calculate np.mean.

You can also use a profiler to see if there are operations that are expensive, or just consider using C++ and reimplement the code(it's very simple).

Upvotes: 1

Related Questions