Pandas rolling dataframe to create clusters

Question

I'm trying to create a dataframe in python by doing a cumulative sum for a particular window size(x) with fixed increments(y) and if the sum is more than a predefined value(z), I want to add values(depending on a column) to another dataframe. I can use a ton of for loops to do this, but that seems to be inefficient.

To describe the problem in details, this is how my dataframe looks like:

│ Start  │ End    │ Count │
├────────┼────────┼───────┤
│ 49076  │ 49095  │ 1     │
│ 50296  │ 50318  │ 1     │
│ 53291  │ 53308  │ 1     │
│ 56213  │ 56232  │ 3     │
│ 86489  │ 86508  │ 13    │
│ 86489  │ 86508  │ 7     │
│ 150696 │ 150713 │ 1     │
│ 174807 │ 174824 │ 1     │
│ 192491 │ 192508 │ 1     │
│ 203977 │ 203996 │ 1     │
│ 269679 │ 269696 │ 1     │
⋮

and the window operates on Start, instead of the row. This means that if the window is 10000 (x), then the first window will start from 49076 and will end at 59076. With increment of 1000 (y), the second window will start at 50076 and will end at 60076 and so on. Now if in any of these windows the sum of Count column exceeds the threshold (z), I want the min(Start) and max(End) along with sum for that window to be stored as row in a new dataframe (with overlapping windows merged). Note: The window theoretically can also start with 0/1 and end at max(End).

What would be the most optimised way to do this(using pandas, numpy or scikit-learn) ?

Edit: I made a diagram to roughly describe what I want. In my case the pandas dataframe is actually describing segments* of a long line. What I want is another dataframe where if the sum of segments in a window exceeds a threshold, I get the start of left most segment and the end of rightmost segment into a new dataframe. *-(if we consider count column to contain just one)

Note: I asked a similar question for Julia which has been deleted now. This can be considered analogous to clustering, with the only difference being that the clusters are created on a line instead of a two or dimensions and the distance between points is just a subtraction of coordinates.

Pandas rolling dataframe to create clusters

Answers (1)

Related Questions