Reputation: 103

Add/fill pandas column based on range in rows from another dataframe

Working with pandas, I have df1 indexed by time samples:

data = '''\
time       flags    input                  
8228835.0  53153.0  32768.0
8228837.0  53153.0  32768.0
8228839.0  53153.0  32768.0
8228841.0  53153.0  32768.0
8228843.0  61345.0  32768.0'''

fileobj = pd.compat.StringIO(data)
df1 = pd.read_csv(fileobj, sep='\s+', index_col='time')

df2 indicates time ranges with start and end to define ranges where the state of 'check' is True:

data = '''\
        check     start       end
20536   True   8228837   8228993
20576   True   8232747   8232869
20554   True   8230621   8230761
20520   True   8227351   8227507
20480   True   8223549   8223669
20471   True   8221391   8221553'''

fileobj = pd.compat.StringIO(data)
df2 = pd.read_csv(fileobj, sep='\s+')

What I need to do is add a column for 'check' to df1 and fill out the actual time ranges defined in df2 with the value of True. All others should be False. An example result would be:

             flags    input    check
time                       
8228835.0  53153.0  32768.0    False
8228837.0  53153.0  32768.0    True
8228839.0  53153.0  32768.0    True
8228841.0  53153.0  32768.0    True
8228843.0  61345.0  32768.0    True
....
8228994.0. 12424.0. 32768.0.   False

Upvotes: 3

Answers (4)

arthaigo

Reputation: 453

A little late, but I was looking for a more general solution, that allowed to copy any columns from the interval array over to the "normal" indexed array (and not just set a bool flag).

For this I modified the IntervalIndex approach, which seems to work really nicely. I have not timed the results, though!

Note, that you can play around with the closed argument for the IntervalIndex depending on your requirements

import pandas as pd


def merge_from_interval(data, interval_array, start_col, end_col, colnames):
    interval_array_with_index = interval_array.copy()

    interval_array_with_index.index = pd.IntervalIndex.from_arrays(interval_array[start_col], interval_array[end_col], closed="left")

    data_merged = data.merge(interval_array_with_index.reindex(data.index)[colnames], left_index=True, right_index=True)
    return data_merged


intervals = pd.DataFrame({"start": [3, 6], "end": [5, 15], "value": [1, 2]})

data = pd.DataFrame(list(range(20)))

merge_from_interval(data, intervals, "start", "end", ["value"])

Upvotes: 0

jpp

Reputation: 164843

You can make a list or ranges, and then use pd.Index.isin with itertools.chain:

from itertools import chain

df2 = df2[df2['check']]

ranges = map(range, df2['start'], df2['end'])

df1['check'] = df1.index.isin(chain.from_iterable(ranges))

print(df1)

             flags    input  check
time                              
8228835.0  53153.0  32768.0  False
8228837.0  53153.0  32768.0   True
8228839.0  53153.0  32768.0   True
8228841.0  53153.0  32768.0   True
8228843.0  61345.0  32768.0   True

Upvotes: 2

Anton vBR

Reputation: 18924

A list comprehension using any(). No clue about the actual performance though, would be nice if you could run the %timings for us!

df1['check'] = [any(start <= i <= end for start,end in 
                    zip(df2['start'], df2['end'])) for i in df1.index]

print(df1)

Returns:

             flags    input  check
time                              
8228835.0  53153.0  32768.0  False
8228837.0  53153.0  32768.0   True
8228839.0  53153.0  32768.0   True
8228841.0  53153.0  32768.0   True
8228843.0  61345.0  32768.0   True

Upvotes: 0

BENY

Reputation: 323396

I think you can using IntervalIndex with loc

df2.index=pd.IntervalIndex.from_arrays(df2.start,df2.end,'both')
df2.loc[df.index]
Out[174]: 
        check  start  end
[1, 2]   True      1    2
[4, 5]   True      4    5
[7, 8]   True      7    8
df['newcol']=df2.loc[df.index].check.values.tolist()
df
Out[176]: 
       flags    input  newcol
flags                        
2          2  32768.0    True
4          4  32768.0    True
7          7  32768.0    True

Upvotes: 2

Add/fill pandas column based on range in rows from another dataframe

Answers (4)

Related Questions