Reputation: 623
I wish to merge two data sets. The first data set contains information CO2 concentrations in an instrument at specific times. The second data set contains information on when a measurement was started and ended.
I would like to combine the two dataframes so that times from data set 1 that are within the time interval (start_date_time to end_date_time) in data set 2 end up on the same row. That is not very well explained, so purhaps it is easier to look at the data below.
I'm new to R so, any help is much appreciated.
The data looks similar to what is shown below.
Data set 1
co2 date_time
0.2 2012-05-01 10:23:20
0.2 2012-05-01 10:23:23
0.1 2012-05-01 10:23:26
0.6 2012-05-01 10:23:29
0.1 2012-05-01 10:23:32
0.1 2012-05-01 10:23:35
0.6 2012-05-01 10:23:38
0.6 2012-05-01 10:23:41
0.2 2012-05-01 10:23:44
0.2 2012-05-01 10:23:47
0.3 2012-05-01 10:23:50
0.3 2012-05-01 10:23:53
0.4 2012-05-01 10:23:56
0.4 2012-05-01 10:23:59
0.3 2012-05-01 10:24:02
0.3 2012-05-01 10:24:05
0.4 2012-05-01 10:24:08
0.4 2012-05-01 10:24:11
0.6 2012-05-01 10:24:14
0.6 2012-05-01 10:24:17
0.7 2012-05-01 10:24:20
0.7 2012-05-01 10:24:23
0.3 2012-05-01 10:24:26
0.7 2012-05-01 10:24:29
0.3 2012-05-01 10:24:32
0.4 2012-05-01 10:24:35
0.2 2012-05-01 10:24:38
0.3 2012-05-01 10:24:41
0.3 2012-05-01 10:24:44
0.3 2012-05-01 10:24:47
0.4 2012-05-01 10:24:50
0.1 2012-05-01 10:24:53
0.1 2012-05-01 10:24:56
Data set 2
jar start_date_time end_date_time
1 2012-05-01 10:23:18 2012-05-01 10:23:30
2 2012-05-01 10:23:31 2012-05-01 10:23:40
3 2012-05-01 10:23:41 2012-05-01 10:23:52
4 2012-05-01 10:23:53 2012-05-01 10:24:01
5 2012-05-01 10:24:02 2012-05-01 10:24:12
6 2012-05-01 10:24:13 2012-05-01 10:24:20
7 2012-05-01 10:24:30 2012-05-01 10:23:40
8 2012-05-01 10:24:41 2012-05-01 10:24:52
Expected data set
jar start_date_time end_date_time co2 date_time
1 05-01-12 10:23:18 05-01-12 10:23:30 0.2 05-01-12 10:23:20
1 05-01-12 10:23:18 05-01-12 10:23:30 0.2 05-01-12 10:23:23
1 05-01-12 10:23:18 05-01-12 10:23:30 0.1 05-01-12 10:23:26
1 05-01-12 10:23:18 05-01-12 10:23:30 0.6 05-01-12 10:23:29
2 05-01-12 10:23:31 05-01-12 10:23:40 0.1 05-01-12 10:23:32
2 05-01-12 10:23:31 05-01-12 10:23:40 0.1 05-01-12 10:23:35
2 05-01-12 10:23:31 05-01-12 10:23:40 0.6 05-01-12 10:23:38
3 05-01-12 10:23:41 05-01-12 10:23:52 0.6 05-01-12 10:23:41
3 05-01-12 10:23:41 05-01-12 10:23:52 0.2 05-01-12 10:23:44
3 05-01-12 10:23:41 05-01-12 10:23:52 0.2 05-01-12 10:23:47
3 05-01-12 10:23:41 05-01-12 10:23:52 0.3 05-01-12 10:23:50
4 05-01-12 10:23:53 05-01-12 10:24:01 0.3 05-01-12 10:23:53
4 05-01-12 10:23:53 05-01-12 10:24:01 0.4 05-01-12 10:23:56
4 05-01-12 10:23:53 05-01-12 10:24:01 0.4 05-01-12 10:23:59
5 05-01-12 10:24:02 05-01-12 10:24:12 0.3 05-01-12 10:24:02
5 05-01-12 10:24:02 05-01-12 10:24:12 0.3 05-01-12 10:24:05
5 05-01-12 10:24:02 05-01-12 10:24:12 0.4 05-01-12 10:24:08
5 05-01-12 10:24:02 05-01-12 10:24:12 0.4 05-01-12 10:24:11
6 05-01-12 10:24:13 05-01-12 10:24:20 0.6 05-01-12 10:24:14
6 05-01-12 10:24:13 05-01-12 10:24:20 0.6 05-01-12 10:24:17
6 05-01-12 10:24:13 05-01-12 10:24:20 0.7 05-01-12 10:24:20
7 05-01-12 10:24:30 05-01-12 10:24:40 0.3 05-01-12 10:24:32
7 05-01-12 10:24:30 05-01-12 10:24:40 0.4 05-01-12 10:24:35
7 05-01-12 10:24:30 05-01-12 10:24:40 0.2 05-01-12 10:24:38
8 05-01-12 10:24:41 05-01-12 10:24:50 0.3 05-01-12 10:24:41
8 05-01-12 10:24:41 05-01-12 10:24:50 0.3 05-01-12 10:24:44
8 05-01-12 10:24:41 05-01-12 10:24:50 0.3 05-01-12 10:24:47
8 05-01-12 10:24:41 05-01-12 10:24:50 0.4 05-01-12 10:24:50
So far I have tried the following:
sqldf("select * from df1 inner join df2 on (df1.date_time between df2.start_date_time and df2.end_date_time)")
R message: Error in match.fun(asfn) : 'c("as.hms", "as.difftime")' is not a function, character or symbol
and
sqldf("select * from df1 date_time inner join df2 start_date_time on (df1.date_time > df2.start_date_time and d1.date_time <= df2.end_date_time)")
R message: Error in result_create(conn@ptr, statement) : no such column: ch4.new_date_time
and
do.call(rbind, apply(df1, 1, function(x) {
if(length(idx <- which(x["date_time"] >= df2$start_date_time & x["date_time"] <= df2$end_date_time)) > 0) {
cbind(rbind(x), df2[idx,])
}
}))
and as suggested by AEF
crossing(df1, df2) %>%
filter(date_time >= start_date_time, date_time < end_date_time)
R message:
[1] campaign date_time ch4_ppm co2_ppm jar_camp
[6] jar campaign1 year month day
[11] date bvoc start date_time bvoc end date_time bvoc start time bvoc end time
[16] start_date_time end_date_time ch4 start time ch4 end time o2 start date_time
[21] o2 end date_time o2 start time o2 end time comments
<0 rows> (or 0-lengt row.names)
Upvotes: 0
Views: 513
Reputation: 5650
I think you can do the following:
Read your data:
library(tidyverse)
library(lubridate)
ds1 <-
str_replace_all(" co2_conc. date_time
0.2 2012-05-01 10:23:20
0.2 2012-05-01 10:23:23
0.1 2012-05-01 10:23:26
0.6 2012-05-01 10:23:29
0.1 2012-05-01 10:23:32
0.1 2012-05-01 10:23:35
0.6 2012-05-01 10:23:38
0.6 2012-05-01 10:23:41
0.2 2012-05-01 10:23:44
0.2 2012-05-01 10:23:47
0.3 2012-05-01 10:23:50
0.3 2012-05-01 10:23:53
0.4 2012-05-01 10:23:56
0.4 2012-05-01 10:23:59
0.3 2012-05-01 10:24:02
0.3 2012-05-01 10:24:05
0.4 2012-05-01 10:24:08
0.4 2012-05-01 10:24:11
0.6 2012-05-01 10:24:14
0.6 2012-05-01 10:24:17
0.7 2012-05-01 10:24:20
0.7 2012-05-01 10:24:23
0.3 2012-05-01 10:24:26
0.7 2012-05-01 10:24:29
0.3 2012-05-01 10:24:32
0.4 2012-05-01 10:24:35
0.2 2012-05-01 10:24:38
0.3 2012-05-01 10:24:41
0.3 2012-05-01 10:24:44
0.3 2012-05-01 10:24:47
0.4 2012-05-01 10:24:50
0.1 2012-05-01 10:24:53
0.1 2012-05-01 10:24:56", " {2,}", ";") %>%
read_delim(delim = ";", trim_ws = TRUE)
ds2 <-
str_replace_all("jar_no start_date_time end_date_time
1 2012-05-01 10:23:18 2012-05-01 10:23:30
2 2012-05-01 10:23:31 2012-05-01 10:23:40
3 2012-05-01 10:23:41 2012-05-01 10:23:52
4 2012-05-01 10:23:53 2012-05-01 10:24:01
5 2012-05-01 10:24:02 2012-05-01 10:24:12
6 2012-05-01 10:24:13 2012-05-01 10:24:20
7 2012-05-01 10:24:30 2012-05-01 10:23:40
8 2012-05-01 10:24:41 2012-05-01 10:24:52", " {2,}", ";") %>%
read_delim(delim = ";", trim_ws = TRUE)
The actual computation: Compute the cartesian product of the rows and filter
crossing(ds1, ds2) %>%
filter(date_time >= start_date_time,
date_time < end_date_time)
However, the result is not exactly the same you posted as "expected result". I'm not sure if I misunderstood or if your example is incorrect. E.g. I cannot understand where the 4th line of the expected result comes from. There seem to be no corresponding start/end dates in dataset 2.
Upvotes: 2