Reputation: 3994
I have a dataset datA
that looks as
ID LINE_ID Fromdate Todate UniqueID
1 1 1/1/2015 2/3/2015 11
1 2 3/7/2015 3/9/2015 12
2 1 2/2/2015 2/8/2015 21
3 1 1/3/2013 1/8/2013 31
And a datB
that looks like
ID LINE_ID Date UniqueID
1 1 1/1/2015 11
1 2 1/3/2015 12
1 3 2/2/2015 13
1 4 2/8/2015 14
1 5 3/8/2015 15
2 1 2/2/2015 21
2 2 2/3/2015 22
2 3 2/7/2015 23
2 4 2/8/2015 24
3 1 1/3/2013 31
3 2 1/7/2013 32
3 3 1/8/2013 33
3 4 1/9/2013 34
What I want to do is find some way to either combine both datasets or find some way to tag which UniqueID
in datA
belongs to what line in datB
. For example, using the two datasets above, I would like to have datC
look like datB
with an added column for which UniqueID
in datA
corresponds.
ID LINE_ID Date UniqueID UniqueID.A
1 1 1/1/2015 11 11
1 2 1/3/2015 12 11
1 3 2/2/2015 13 11
1 4 2/8/2015 14 NA
1 5 3/8/2015 15 12
2 1 2/2/2015 21 21
2 2 2/3/2015 22 21
2 3 2/7/2015 23 21
2 4 2/8/2015 24 21
3 1 1/3/2013 31 31
3 2 1/7/2013 32 31
3 3 1/8/2013 33 31
3 4 1/9/2013 34 NA
As one can see, the new column added to datB
is which UniqueID
in datA
does the date range in datB
fall within.
Does anyone have any guidance on how to do this in R?
Upvotes: 0
Views: 78
Reputation: 109
I wrote a function that you could use to merge pandas dataframes based on ranges: https://github.com/Chrisebell24/util/blob/master/custom_merge
import numpy as np
import pandas as pd
def merge_range(left, right, left_val, right_low, right_high):
'''
Parameters
----------
left : pandas dataframe
Left DataFrame that you are joining onto
right : pandas dataframe
Right DataFrame that has the ranged values
left_val : list
list of columns from left that contains numeric values
right_low : list
list of columns that you are matching to left_val
right_high : list
list of columns that you are matching to left_val
Returns
-------
Pandas dataframe
Example
-------
import numpy as np
np.random.seed(69)
df1, df2 = pd.DataFrame(), pd.DataFrame()
df1['LETTERS'] = ['a', 'b','c','d','e','f', 'g']
df1['RAND'] = np.random.randint(1,len(df1)*10, len(df1))
df2['COLOR'] = ['red', 'orange','yellow','green','blue','purple','brown']
df2['LOW'] = [i*10 for i in range(0, len(df2))]
df2['HIGH'] = [i*10 for i in range(1, 1+len(df2))]
merge_range(df1, df2, left_val=['RAND'], right_low=['LOW'], right_high= ['HIGH'])
LETTERS RAND COLOR LOW HIGH
0 a 55 purple 50 60
1 b 56 purple 50 60
2 c 21 yellow 20 30
3 d 50 purple 50 60
4 e 23 yellow 20 30
5 f 10 orange 10 20
6 g 57 purple 50 60
'''
true_args = True
for lv, rl, rh in zip(left_val, right_low, right_high):
a = left[lv].values
bl = right[rl].values
bh = right[rh].values
true_args *= (a[:, None] >= bl) & (a[:, None] < bh)
i, j = np.where(true_args)
df = pd.DataFrame(
np.column_stack([left.values[i], right.values[j]]),
columns=left.columns.append(right.columns)
)
if len(df.columns) - len(set(df.columns)) > 0:
print('Duplicate columns')
return df
Upvotes: 0
Reputation: 28441
To help with the implementation. Be sure that columns are the correct types and keys are set:
#Create a start and end columns formatted to dates
setDT(datB)[, ':='(c("start", "end"), as.IDate(Date, format="%m/%d/%Y"))]
#Format columns to dates
setDT(datA)[, ':='(c("Fromdate", "Todate"),
lapply(.SD, as.IDate, format="%m/%d/%Y")),
.SDcols=c("Fromdate", "Todate")]
#Set keys for matching intervals on
setkey(datA, Fromdate, Todate)
setkey(datB, start, end)
#Match on intervals
foverlaps(datB, datA, type="within")
Upvotes: 1