marcinus
marcinus

Reputation: 145

R: Subset data frame based on multiple values for multiple variables

I need to pull records from a first data set (called df1 here) based on a combination of specific dates, ID#s, event start time, and event end time that match with a second data set (df2). Everything works fine when there is just 1 date, ID, and event start and end time, but some of the matching records between the data sets contain multiple IDs, dates, or times, and I can't get the records from df1 to subset properly in those cases. I ultimately want to put this in a FOR loop or independent function since I have a rather large dataset. Here's what I've got so far:

I started just by matching the dates between the two data sets as follows:

match_dates <- as.character(intersect(df1$Date, df2$Date))

Then I selected the records in df2 based on the first matching date, also keeping the other columns so I have the other ID and time information I need:

records <- df2[which(df2$Date == match_dates[1]), ]

The date, ID, start, and end time from records are then:

[1] "01-04-2009" "599091"     "12:00"      "17:21" 

Finally I subset df1 for before and after the event based on the date, ID, and times in records and combined them into a new data frame called final to get at the data contained in df1 that I ultimately need.

before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
after <- subset(df1, NUM==records$ID & Date==records$Date & Time>records$End)
final <- rbind(before, after)

Here's the real problem - some of the matching dates have more than 1 corresponding row in df2, and return multiple IDs or times. Here is what an example of multiple records looks like:

records <- df2[which(df2$Date == match_dates[25]), ]

> records$ID
[1] 507646 680845 680845
> records$Date
[1] "04-02-2009" "04-02-2009" "04-02-2009"
> records$Start
[1] "09:43" "05:37" "11:59"
> records$End
[1] "05:19" "11:29" "16:47"

When I try to subset df1 based on this I get an error:

before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
Warning messages:
1: In NUM == records$ID :
  longer object length is not a multiple of shorter object length
2: In Date == records$Date :
  longer object length is not a multiple of shorter object length
3: In Time < records$Start :
  longer object length is not a multiple of shorter object length

Trying to do it manually for each ID-date-time combination would be way to tedious. I have 9 years worth of data, all with multiple matching dates for a given year between the data sets, so ideally I would like to set this up as a FOR loop, or a function with a FOR loop in it, but I can't get past this. Thanks in advance for any tips!

Upvotes: 4

Views: 5394

Answers (1)

NGaffney
NGaffney

Reputation: 1532

If you're asking what I think you are the filter() function from the dplyr package combined with the match function does what you're looking for.

> df1 <- data.frame(A = c(rep(1,4),rep(2,4),rep(3,4)), B = c(rep(1:4,3)))
> df1
   A B
1  1 1
2  1 2
3  1 3
4  1 4
5  2 1
6  2 2
7  2 3
8  2 4
9  3 1
10 3 2
11 3 3
12 3 4
> df2 <- data.frame(A = c(1,2), B = c(3,4))
> df2
  A B
1 1 3
2 2 4
> filter(df1, A %in% df2$A, B %in% df2$B)
  A B
1 1 3
2 1 4
3 2 3
4 2 4

Upvotes: 1

Related Questions