Reputation: 5861
I have two dataframes that I would like to relate.
The first dataframe (LOC) has about two hundred animals studied over 5 years, so about 100000 point locations. I need to know what pack each animal was in during the study duration, but unfortunately, animals change packs every so often. I need each point location to have a pack name associated with it since I will base further analyses on the pack, not the individual.
The second dataframe (PACK) has pack affiliations for these three hundred animals, and each row represents an entry day and exit day into a pack. My real data is a table with about 700 rows, and each row represents how long that animal stayed in the pack. For example in the PACK dataframe below, animal "W1" was in SunnyLake in the first day of year "A", then on day 200 left and went to RainyLake and stayed there until the end of year "B" (a leap year).
Since I have so many animals and so many switches among years, I want to know a way to tell R which days the animal belonged to SunnyLake and RainyLake in the LOC table.
So far my approach has been the following, but it is slow. I'm pretty new to R, so I thought there must be a faster, more elegant way to do this. If you see a better solution, please let me know (and keep in mind I just started working in R a few months ago)!
My example data:
Location table
LOC <- data.frame(matrix(NA, nrow = 8, ncol = 4))
colnames(LOC) <- c("ID", "Yr", "Dy", "Pack")
LOC$ID <- "W1"
LOC$Yr <- rep(c("A", "B"), each = 4)
LOC$Dy <- c(4, 200, 300, 335, 3, 100, 150, 350)
LOC
ID Yr Dy Pack
1 W1 A 4 NA
2 W1 A 200 NA
3 W1 A 300 NA
4 W1 A 335 NA
5 W1 B 3 NA
6 W1 B 100 NA
7 W1 B 150 NA
8 W1 B 350 NA
Pack Table
PACK <- data.frame(matrix(nrow = 3, ncol = 5))
colnames(PACK) <- c("ID", "Pack", "EnterDay", "ExitDay", "Yr")
PACK$ID <- "W1"
PACK$Pack <- c("SunnyLake", "RainyLake", "RainyLake")
PACK$EnterDay <- c(1, 201, 1)
PACK$ExitDay <- c(200, 365, 366)
PACK$Yr <- c("A", "A", "B")
PACK
ID Pack EnterDay ExitDay Yr
1 W1 SunnyLake 1 200 A
2 W1 RainyLake 201 365 A
3 W1 RainyLake 1 366 B
My way of steamrolling through each row in PACK and filling the "Pack" column in LOC
for (i in 1:nrow(PACK)){
cat("LOC$Pack[LOC$ID == \"", as.character(PACK$ID[i]),"\" & LOC$Yr == \"", as.character(PACK$Yr[i]),"\" & LOC$Dy >= ", PACK$EnterDay[i], " & LOC$Dy <= ", PACK$ExitDay[i],"] <- \"", as.character(PACK$Pack[i]),"\"\n", sep="")
}
That command prints the following, which I paste back into the console and run.
LOC$Pack[LOC$ID == "W1" & LOC$Yr == "A" & LOC$Dy >= 1 & LOC$Dy <= 200] <- "SunnyLake"
LOC$Pack[LOC$ID == "W1" & LOC$Yr == "A" & LOC$Dy >= 201 & LOC$Dy <= 365] <- "RainyLake"
LOC$Pack[LOC$ID == "W1" & LOC$Yr == "B" & LOC$Dy >= 1 & LOC$Dy <= 366] <- "RainyLake"
The resulting LOC table looks like this:
LOC
ID Yr Dy Pack
1 W1 A 4 SunnyLake
2 W1 A 200 SunnyLake
3 W1 A 300 RainyLake
4 W1 A 335 RainyLake
5 W1 B 3 RainyLake
6 W1 B 100 RainyLake
7 W1 B 150 RainyLake
8 W1 B 350 RainyLake
Upvotes: 2
Views: 732
Reputation: 269644
1) sqldf Try this:
library(sqldf)
sqldf("select L.ID, L.Yr, L.Dy, P.Pack from LOC L left join PACK P
on L.Yr = P.Yr and L.ID = P.ID and L.Dy between P.EnterDay and P.ExitDay")
giving:
ID Yr Dy Pack
1 W1 A 4 SunnyLake
2 W1 A 200 SunnyLake
3 W1 A 300 RainyLake
4 W1 A 335 RainyLake
5 W1 B 3 RainyLake
6 W1 B 100 RainyLake
7 W1 B 150 RainyLake
8 W1 B 350 RainyLake
2) dplyr or
library(dplyr)
left_join(LOC, PACK, by = c("ID", "Yr")) %.%
filter((Dy >= EnterDay & Dy <= ExitDay) | is.na(Pack.y)) %.%
select(ID:Dy, Pack.y)
giving
ID Yr Dy Pack.y
1 W1 A 4 SunnyLake
2 W1 A 200 SunnyLake
3 W1 A 300 RainyLake
4 W1 A 335 RainyLake
5 W1 B 3 RainyLake
6 W1 B 100 RainyLake
7 W1 B 150 RainyLake
8 W1 B 350 RainyLake
ADDED: Added second solution and improved both. Fixed dplyr solution to not drop LOC
rows witih no PACK
row.
Upvotes: 3
Reputation: 59355
First, don't create the Pack
column in LOC
at the beginning; it's not necessary.
LOC <- data.frame(matrix(NA, nrow = 8, ncol = 3))
colnames(LOC) <- c("ID", "Yr", "Dy") # NOTE: No Pack column
LOC$ID <- "W1"
LOC$Yr <- rep(c("A", "B"), each = 4)
LOC$Dy <- c(4, 200, 300, 335, 3, 100, 150, 350)
Here's a method that uses data tables, which is likely to be much faster with large datasets.
library(data.table)
LOC <- data.table(LOC, key="ID,Yr")
PACK <- data.table(PACK, key="ID,Yr")
LOC$Pack <-LOC[PACK,all=T][Dy>=EnterDay & Dy<=ExitDay,Pack]
LOC
# ID Yr Dy Pack
# 1 W1 A 4 SunnyLake
# 2 W1 A 200 SunnyLake
# 3 W1 A 300 RainyLake
# 4 W1 A 335 RainyLake
# 5 W1 B 3 RainyLake
# 6 W1 B 100 RainyLake
# 7 W1 B 150 RainyLake
# 8 W1 B 350 RainyLake
And here's a method that uses data frames (and does not require sqldf
).
M <- merge(LOC,PACK,by=c("ID","Yr"))
is.between <- function(x,low,hi)return(x>=low & x<=hi)
LOC$Pack <- with(M,M[is.between(Dy,EnterDay, ExitDay),]$Pack)
Upvotes: 2