Reputation:
I have a Dataframe looks like the following..
import pandas as pd
import numpy as np
# Create data set.
present = 12
died = 20
dataSet = {'id': ['A', 'A', 'A','A','B','B','B','C'],
'id_2': [1, 2, 3, 1, 1,2,3,1],
'start' : [9,13,12,11,9,20,22,13],
'end' : [14,22,21,19,10,30,24,18]}
# Create dataframe with data set and named columns.
df = pd.DataFrame(dataSet, columns= ['id', 'id_2', 'start','end'])
id id_2 start end
0 A 1 9 14
1 A 2 13 22
2 A 3 12 21
3 A 1 11 19
4 B 1 9 10
5 B 2 20 30
6 B 3 22 24
7 C 1 13 18
we have present = 12, and died = 20 and i want to filter the dataframe in a following diagram.
where pink box represents df_begin, yellow for df_between purple for df_end.
I want to combine this, but I had to do it separately as following. (present and died inclusive)
df_start = df.loc[(df['start'] <= present) & (df['end'] >=present)]
df_between = df.loc[(df['start'] >= present) & (df['end'] <= died)]
df_end = df.loc[(df['start'] <= died) & (df['end'] >= died)]
concat these three dataframes and drop duplicate will give me 3 colored box combined which is I want, but is there way to do in a simple/better/fancy way? (ie. imagine this dataframe is more than 1million - performance also matters..)
Hence, desired output would be..
id id_2 start end
0 A 1 9 14
1 A 2 13 22
2 A 3 12 21
3 A 1 11 19
4 B 2 20 30
5 C 1 13 18
Thank you!
Upvotes: 1
Views: 127
Reputation: 153500
IIUC, you can do it with two conditions and a negation with OR.
m1 = (df['end'] < present) #Find all ranges that end before present
m2 = (df['start'] > died) #Find all ranges that start after died
df[~(m1|m2)] #Negate to find all ranges that intercept and overlap present to died
Output:
id id_2 start end
0 A 1 9 14
1 A 2 13 22
2 A 3 12 21
3 A 1 11 19
5 B 2 20 30
7 C 1 13 18
Upvotes: 1
Reputation: 962
If I understand this correctly, you are looking to have three separates dataframes according to the logic specified, and at the same time be able to concat them into a single dataframe with no duplicates rapidly.
You can save your conditions as masks:
df_start_mask = (df['start'] <= present) & (df['end'] >=present)
df_between_mask = (df['start'] >= present) & (df['end'] <= died)
df_end_mask = (df['start'] <= died) & (df['end'] >= died)
Creating the three separate dataframes is similar to before:
df_start = df.loc[df_start_mask]
df_between = df.loc[df_between_mask]
df_end = df.loc[df_end_mask]
However creating the combined dataframe is much faster, because instead of having to concat, you can directly index from your original dataframe:
combined_df = df.loc[df_start_mask | df_between_mask | df_end_mask ]
Which returns the intended result:
>>> print(combined_df)
id id_2 start end
0 A 1 9 14
1 A 2 13 22
2 A 3 12 21
3 A 1 11 19
5 B 2 20 30
7 C 1 13 18
Upvotes: 1