ataraxis
ataraxis

Reputation: 1582

how to split a series in pandas based on another

I have two series in python pandas.

One from values within a file called values.csv. It looks like this:

time, value
0, 10312435
9, 45924523
11, 43423434
20, 42343552
...

Another one is called breaks.csv and it looks like this:

time
5
18
...

Problem: I would like to split the values.csv into separate frames based on the values in breaks.csv.

In the example above the first breakpoint would be 5, resulting in a file or set containing all entries within time \in [0, 5], therefore only the value 0, 10312435. The second breakpoint is 18, therefore the second batch of values should be within (5, 18] i.e. 9, 45924523 and 11, 43423434 and so forth.

Is something like this possible in pandas (or maybe some other easy-to-use python package)?

Upvotes: 2

Views: 834

Answers (2)

ataraxis
ataraxis

Reputation: 1582

I came up with the following, based on Pandas split DataFrame by column value

sim_dist_right = pandas.read_csv('sim/dist_right.csv', comment='#')
sim_round_indicator = pandas.read_csv('sim/round_indicator.csv', comment='#')

round_list = []
for index, row in sim_round_indicator.iterrows():
    print("splitting at " + str(row['time']))
    df_sep = sim_dist_right[sim_dist_right['time'] < row['time']]
    
    round_list.append(df_sep)
    print("separated a batch of " + str(len(df_sep)) + " elements")
    
    df_over = sim_dist_right[sim_dist_right['time'] >= row['time']]
    print(str(len(df_over)) + " elements over")
    
    sim_dist_right = df_over
    
print("splitted values into " + str(len(round_list)) + " batches")

Upvotes: 0

Mustafa Aydın
Mustafa Aydın

Reputation: 18306

You can first form bins from breaks.time and then assign categories to values.time with these bins using pd.cut:

import numpy as np

# intervals to fall into
bins = [-np.inf, *breaks.time, +np.inf]

# distinct labels of 0..N-1
labels = np.arange(len(bins) - 1)

# form a new column in `values` with assigned categories
values["cats"] = pd.cut(values.time, bins=bins, labels=labels)

At this point values looks like:

>>> values

   time     value cats
0     0  10312435    0
1     9  45924523    1
2    11  43423434    1
3    20  42343552    2

Now we can group by cats and, for example, form a list of dataframes:

# no need for `cats` column anymore, so we drop it when putting in
frames_list = [frame.drop(columns="cats")
               for _, frame in values.groupby("cats")[["time", "value"]]]

We can access the frames as

>>> frames_list[0]

   time     value
0     0  10312435


>>> frames_list[1]

   time     value
1     9  45924523
2    11  43423434

>>> frames_list[2]

   time     value
3    20  42343552

Upvotes: 2

Related Questions