Reputation: 103
Working with pandas, I have df1 indexed by time samples:
data = '''\
time flags input
8228835.0 53153.0 32768.0
8228837.0 53153.0 32768.0
8228839.0 53153.0 32768.0
8228841.0 53153.0 32768.0
8228843.0 61345.0 32768.0'''
fileobj = pd.compat.StringIO(data)
df1 = pd.read_csv(fileobj, sep='\s+', index_col='time')
df2 indicates time ranges with start and end to define ranges where the state of 'check' is True:
data = '''\
check start end
20536 True 8228837 8228993
20576 True 8232747 8232869
20554 True 8230621 8230761
20520 True 8227351 8227507
20480 True 8223549 8223669
20471 True 8221391 8221553'''
fileobj = pd.compat.StringIO(data)
df2 = pd.read_csv(fileobj, sep='\s+')
What I need to do is add a column for 'check' to df1 and fill out the actual time ranges defined in df2 with the value of True. All others should be False. An example result would be:
flags input check
time
8228835.0 53153.0 32768.0 False
8228837.0 53153.0 32768.0 True
8228839.0 53153.0 32768.0 True
8228841.0 53153.0 32768.0 True
8228843.0 61345.0 32768.0 True
....
8228994.0. 12424.0. 32768.0. False
Upvotes: 3
Views: 5007
Reputation: 453
A little late, but I was looking for a more general solution, that allowed to copy any columns from the interval array over to the "normal" indexed array (and not just set a bool flag).
For this I modified the IntervalIndex approach, which seems to work really nicely. I have not timed the results, though!
Note, that you can play around with the closed
argument for the IntervalIndex
depending on your requirements
import pandas as pd
def merge_from_interval(data, interval_array, start_col, end_col, colnames):
interval_array_with_index = interval_array.copy()
interval_array_with_index.index = pd.IntervalIndex.from_arrays(interval_array[start_col], interval_array[end_col], closed="left")
data_merged = data.merge(interval_array_with_index.reindex(data.index)[colnames], left_index=True, right_index=True)
return data_merged
intervals = pd.DataFrame({"start": [3, 6], "end": [5, 15], "value": [1, 2]})
data = pd.DataFrame(list(range(20)))
merge_from_interval(data, intervals, "start", "end", ["value"])
Upvotes: 0
Reputation: 164843
You can make a list or ranges, and then use pd.Index.isin
with itertools.chain
:
from itertools import chain
df2 = df2[df2['check']]
ranges = map(range, df2['start'], df2['end'])
df1['check'] = df1.index.isin(chain.from_iterable(ranges))
print(df1)
flags input check
time
8228835.0 53153.0 32768.0 False
8228837.0 53153.0 32768.0 True
8228839.0 53153.0 32768.0 True
8228841.0 53153.0 32768.0 True
8228843.0 61345.0 32768.0 True
Upvotes: 2
Reputation: 18924
A list comprehension using any()
. No clue about the actual performance though, would be nice if you could run the %timings for us!
df1['check'] = [any(start <= i <= end for start,end in
zip(df2['start'], df2['end'])) for i in df1.index]
print(df1)
Returns:
flags input check
time
8228835.0 53153.0 32768.0 False
8228837.0 53153.0 32768.0 True
8228839.0 53153.0 32768.0 True
8228841.0 53153.0 32768.0 True
8228843.0 61345.0 32768.0 True
Upvotes: 0
Reputation: 323396
I think you can using IntervalIndex
with loc
df2.index=pd.IntervalIndex.from_arrays(df2.start,df2.end,'both')
df2.loc[df.index]
Out[174]:
check start end
[1, 2] True 1 2
[4, 5] True 4 5
[7, 8] True 7 8
df['newcol']=df2.loc[df.index].check.values.tolist()
df
Out[176]:
flags input newcol
flags
2 2 32768.0 True
4 4 32768.0 True
7 7 32768.0 True
Upvotes: 2