Reputation: 33
Summarize the Problem
I am trying to optimize some code I have written. In its current form it works as intended, however because of the sheer number of loops required the script it takes a very long time to run.
I'm looking for a method of speeding up the below-described code.
Detail the problem
Within this data frame called master, there are 3,936,192 rows. The Position column represents a genomic window. Which is present in this data frame 76 times. Such that master[master['Position'] == 300]
returns a dataframe of 76 rows, and similar for each unique appearance of Position. I do some operations on each of these subsets of the data frame.
The data can be found here
My current code takes the form:
import pandas as pd
master = pd.read_csv(data_location)
windows = sorted(set(master['Position']))
window_factor = []
# loop through all the windows, look at the cohort of samples, ignore anything not CNV == 2
# if that means ignore all, then drop the window entirely
# else record the 1/2 mean of that windows normalised coverage across all samples.
for window in windows:
current_window = master[master['Position'] == window]
t = current_window[current_window['CNV'] == 2]
if t.shape[0] == 0:
window_factor.append('drop')
else:
window_factor.append(
np.mean(current_window[current_window['CNV'] == 2]['Normalised_coverage'])/2)
However, this takes an exceptionally long time to run and I can't figure out a way to speed this up, though I know there must be one.
Upvotes: 2
Views: 232
Reputation: 13447
your df
is not that big and in your code there are few problems:
np.mean
and one value is np.nan
it returns np.nan
groupby
float
you might consider to use
np.nan
insteadimport pandas as pd
df = pd.read_csv("master.csv")
def fun(x):
t = x[x["CNV"]==2]
return t["Normalised_coverage"].mean()/2
# returns np.nan when len(t)==0
out = df.groupby('Position').apply(fun)
CPU times: user 34.7 s, sys: 72.5 ms, total: 34.8 s
Wall time: 34.7 s
Or even faster filtering before the groupby
as
%%time
out = df[df["CNV"]==2].groupby("Position")["Normalised_coverage"].mean()/2
CPU times: user 82.5 ms, sys: 8.03 ms, total: 90.5 ms
Wall time: 87.8 ms
UPDATE: In the last case if you really need to keep track of groups where df["CNV"]!=2
you can use this code:
import numpy as np
bad = df[df["CNV"]!=2]["Position"].unique()
bad = list(set(bad)-set(out.index))
out = out.reset_index(name="value")
out1 = pd.DataFrame({"Position":bad,
"value":[np.nan]*len(bad)})
out = pd.concat([out,out1],
ignore_index=True)\
.sort_values("Position")\
.reset_index(drop=True)
Which is going to add 160ms
to your computation.
Upvotes: 2
Reputation: 33
Using groupby and query was the solution I went with.
import pandas as pd
import numpy as np
master = pd.read_csv("/home/sean/Desktop/master.csv", index_col=0)
windows = sorted(set(master['Position']))
g = master.groupby("Position")
master.query("Position == 24386700").shape
g = master.query("CNV == 2").groupby("Position")
p = g.Normalised_coverage.mean() / 2
Upvotes: 0
Reputation: 77
I think .groupby() function is what you need here:
fac = []
for name,group in master.groupby('Position'):
if all(group['CNV'] != 2):
fac.append('drop')
else:
fac.append(np.mean(group[group['CNV'] == 2]['Normalised_coverage'])/2)
I downloaded your data master.csv, data generated is exactly the same, running time decreased from 6 min to 30 sec on my laptop. Hope it helps.
Upvotes: 1
Reputation: 4510
You can do several things:
t
is already current_window[current_window['CNV'] == 2]
use t
when calculate np.mean.You can also use a profiler to see if there are operations that are expensive, or just consider using C++ and reimplement the code(it's very simple).
Upvotes: 1