Reputation: 475

Pandas retrieve values based on columns in df1 that represents intervals from df2

I've got this structured pandas dataframe DF1:

import pandas as pd
df = pd.DataFrame({'id': [1, 2], 'name': ['A', 'B'], 'ex': ['A1', 'B1'], 'init': ['1,3,5,7,', '10,12,15,17,20,'], 'end':['2,4,6,8,', '15,18,21,24,32']}, columns=['id', 'name', 'ex', 'init', 'end'])
print df

id  name  ex   init               end
1    A    A1   1,3,5,7,           2,4,6,8,
2    B    B1   10,12,15,17,20,    15,18,21,24,32
...

init and end columns represents intervals (init-end) so: A -> (1,2),(3,4),(5,6),(7,8), B-> (10,15),(12,18),(15,21),(17,24), (20,32),... By the other hand, there is another dataframe DF2 that also represents intervals:

import pandas as pd
df2 = pd.DataFrame({'id': [1, 2], 'init': ['3', '16'], 'end':['6', '22']}, columns=['id', 'init', 'end'])
print df2

id   init   end
1     3      6
2     16     22
...

Is there any smart and efficient way to check what ranges of DF1 contains the range in DF2? In the example, output will be a new DF3:

id  name  ex    ranges
1    A    A1  (3,4), (5,6)
2    B    B1  (12,18),(15,21),(17,24)

I'm a bit blocked with this and help would be appreciated. Thanks in advance!!

Upvotes: 1

Answers (2)

BENY

Reputation: 323326

This is what I will do

df['init'],df['end']=df.init.str.split(','),df.end.str.split(',')
df
Out[1915]:
   id name  ex                    init                   end
0   1    A  A1          [1, 3, 5, 7]        [2, 4, 6, 8]
1   2    B  B1  [10, 12, 15, 17, 20]  [15, 18, 21, 24, 32]
s=df.init.str.len()
s
Out[1917]:
0    4
1    5
Name: init, dtype: int64


NewDF=pd.DataFrame({'id':df.id.repeat(s),'name':df.name.repeat(s),'ex':df.ex.repeat(s),'init':df.init.sum(),'end':df.end.sum()})
s2=NewDF.merge(df2.set_index('id').stack().reset_index(),on='id').loc[lambda x : (x['end']>=x[0])&(x[0]>=x['init'])]
s2['New']=tuple(zip(s2.init,s2.end))
s2.groupby(['ex','id']).New.apply(list)
Out[1943]: 
ex  id
A1  1                             [(3, 4), (5, 6)]
B1  2     [(12, 18), (15, 21), (17, 24), (20, 32)]
Name: New, dtype: object

Upvotes: 0

javidcf

Reputation: 59731

Here is a possible solution:

import pandas as pd
df1 = pd.DataFrame({'id': [1, 2], 'name': ['A', 'B'], 'ex': ['A1', 'B1'], 'init': ['1,3,5,7,', '10,12,15,17,20,'], 'end':['2,4,6,8,', '15,18,21,24,32']}, columns=['id', 'name', 'ex', 'init', 'end'])
df2 = pd.DataFrame({'id': [1, 2], 'init': ['3', '16'], 'end':['6', '22']}, columns=['id', 'init', 'end'])

df1.set_index('id', inplace=True)
df2.set_index('id', inplace=True)

df = df1.copy()
df[['init2', 'end2']] = df2

# Per-row function
def proc_row(row):
    # Make list of int pairs
    intervs = zip(map(int, filter(None, row.init.split(','))), map(int, filter(None, row.end.split(','))))
    # Range object to check numbers are in interval - use xrange in Python 2
    r = range(int(row.init2), int(row.end2) + 1)
    # Discard pairs out of interval
    return list(filter(lambda interv: interv[0] in r or interv[1] in r, intervs))

# Process rows
df['ranges'] = df.apply(proc_row, axis=1)
# Drop unnecesary columns
for k in ['init', 'end', 'init2', 'end2']: del df[k]

print(df)

Output:

   name  ex                                    ranges
id
1     A  A1                          [(3, 4), (5, 6)]
2     B  B1  [(12, 18), (15, 21), (17, 24), (20, 32)]

Upvotes: 2

Pandas retrieve values based on columns in df1 that represents intervals from df2

Answers (2)

Related Questions