Reputation: 373
I'm trying to create a dataframe in python by doing a cumulative sum for a particular window size(x) with fixed increments(y) and if the sum is more than a predefined value(z), I want to add values(depending on a column) to another dataframe. I can use a ton of for loops to do this, but that seems to be inefficient.
To describe the problem in details, this is how my dataframe looks like:
│ Start │ End │ Count │
├────────┼────────┼───────┤
│ 49076 │ 49095 │ 1 │
│ 50296 │ 50318 │ 1 │
│ 53291 │ 53308 │ 1 │
│ 56213 │ 56232 │ 3 │
│ 86489 │ 86508 │ 13 │
│ 86489 │ 86508 │ 7 │
│ 150696 │ 150713 │ 1 │
│ 174807 │ 174824 │ 1 │
│ 192491 │ 192508 │ 1 │
│ 203977 │ 203996 │ 1 │
│ 269679 │ 269696 │ 1 │
⋮
and the window operates on Start, instead of the row. This means that if the window is 10000 (x), then the first window will start from 49076 and will end at 59076. With increment of 1000 (y), the second window will start at 50076 and will end at 60076 and so on. Now if in any of these windows the sum of Count column exceeds the threshold (z), I want the min(Start) and max(End) along with sum for that window to be stored as row in a new dataframe (with overlapping windows merged). Note: The window theoretically can also start with 0/1 and end at max(End).
What would be the most optimised way to do this(using pandas, numpy or scikit-learn) ?
Edit: I made a diagram to roughly describe what I want. In my case the pandas dataframe is actually describing segments* of a long line. What I want is another dataframe where if the sum of segments in a window exceeds a threshold, I get the start of left most segment and the end of rightmost segment into a new dataframe. *-(if we consider count column to contain just one)
Note: I asked a similar question for Julia which has been deleted now. This can be considered analogous to clustering, with the only difference being that the clusters are created on a line instead of a two or dimensions and the distance between points is just a subtraction of coordinates.
Upvotes: 0
Views: 375
Reputation: 373
Okay - I'm trying to post a makeshift solution I made: not really recommending anyone to try it.
for i in positiveFrame.chr.unique():
workingFrame = positiveFrame[positiveFrame['chr'] == i]
totalPeaks = workingFrame['count'].sum()
# print ("Total amount of peaks in contig: ", i, ": ", totalPeaks)
if totalPeaks < minimumSum:
continue
hits = workingFrame.start.tolist()
runningFrame = pd.DataFrame({'start': range(1, lengthFile[str(i)]), 'hit': 0})
runningFrame.loc[runningFrame.start.isin(hits), ['hit']] = 1
for k in range(0, runningFrame.shape[0], incrementSize):
z = runningFrame[k:k + incrementSize]
if (z['hit'].sum() > 49):
#print("banzai\n")
positiveClusterChr.append(i)
positiveClusterStart.append(z['hit'].eq(1).idxmax())
positiveClusterEnd.append(z.hit[::-1].idxmax())
What is essentially happening here is that I'm creating another dataframe with coordinates starting from 0 and ending at max(End) - this step is highly inefficient and memory consuming. In this case the length is defined by an array for each unique chr. In next step I'm assigning hits(counts) to positions which match with the my input dataframe - based on the start position. And in the last step I'm traversing the expanded dataframe by slicing 1000 rows each time and then summing the hits. And for the windows or dataframe slices which match, I'm taking min & max coordinates and storing them in an array. Buggy, highly inefficient and redundant code, but somehow working at minimum level.
Upvotes: 0