Reputation: 19
I have two dataframes, looking at houses (n=6) and certain dates (n=22).
ORIGINAL is the original dataset. It contains 38 observations on 5 variables. Not all houses have all the dates listed, and vice versa, leading to errors in calculations with different length variables.
SAMPLE is a new empty dataset. It contains 132 (6 x 22) observations on the same 5 variables. Now there is an observation for every household for every date.
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 6 32 12 4.2
B 2 50 3 4.0
B 4 51 4 8.6
B 6 8 7 12.1
C 2 12 8 13.0
I am trying to fill in the rest of SAMPLE by asking R to compare HouseID and Date between the two dataframes; if they match, the rest of the variables (mongoose, fruit, elephant) should be copied over for that observation.
I tried this to no avail...
for(i in 1:nrow(original))
{
if ((sample$Day == original$Day) && (sample$House == original$House))
{
sample$Mongoose[i] <- original$Mongoose[i]
sample$Fruit[i] <- original$Fruit[i]
sample$Elephant[i] <- original$Elephant[i]
}
}
The following results:
I get the following 3 errors in sequence
In sample$Day == test$Day : longer object length is not a multiple of shorter object length
In is.na(e1) | is.na(e2) :longer object length is not a multiple of
shorter object length
==.default
(sample$House, test$House) :longer object length isThe data DOES copy over, but incorrectly. All the values get transferred to the A house and sequential date, rather than the appropriate house and date.
I.e., it looks like this
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 2 50 3 4.0
A 3 51 4 8.6
A 4 8 7 12.1
A 5 12 8 13.0
A 6 32 12 4.2
B 1
B 2
B 3 [...]
When it should (in essence) look like this:
House Day Mongoose Fruit Elephant
A 1 40 7 0.6
A 2
A 3
A 4
A 5
A 6 32 12 4.2 [rest of A houses have no data]
B 1
B 2 50 3 4.0
B 3
B 4 51 4 8.6
B 5
B 6 8 7 12.1 [rest of B houses have no data]
C 1
C 2 12 8 13.0
Please advise; I will eventually have to extend this technique to look at a sample dataset with 198K entries, and a test dataset with 115K.
Thanks!
Upvotes: 0
Views: 2598
Reputation: 2046
It could be a small tweak, look at this this line of your original code:
if ((sample$Day == original$Day) && (sample$House == original$House))
See if you can change it to this:
if ((sample$Day[i] == original$Day[i]) && (sample$House[i] == original$House[i]))
Because:
for
loop with an i
variable, sample$Mongoose[i] <- original$Mongoose[i]
if
statement is not actually making use of the i
variable i
so it will be comparing specifically that observation/rows's sample$Day
with that observation/rows's original$Day
, and the same for sample$House
vs original$House
Upvotes: 0
Reputation: 20483
Sounds to me like this should work:
merge(sample, original, by = c("House", "Day"), all.x = TRUE)
But hard to tell without a reproducible example. You may also want to look into dplyr::left_join()
. That is, assuming your data looks like the following:
sample <- data.frame(House = rep(c("A", "B", "C"), each = 6),
Day = rep(1:6, 3))
# head(sample)
# House Day
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 A 5
# 6 A 6
original <- data.frame(House = c("A", "A", "B", "B", "C"),
Day = c(1, 6, 2, 4, 2),
Mongoose = c(40, 32, 50, 51, 8),
Fruit = c(7, 12, 3, 4, 8),
Elephant = c(0.6, 4.2, 4.0, 8.6, 12.1))
# head(original)
# House Day Mongoose Fruit Elephant
# 1 A 1 40 7 0.6
# 2 A 6 32 12 4.2
# 3 B 2 50 3 4.0
# 4 B 4 51 4 8.6
# 5 C 2 8 8 12.1
We obtain:
# head(merge(sample, original, by = c("House", "Day"), all.x = TRUE))
# House Day Mongoose Fruit Elephant
# 1 A 1 40 7 0.6
# 2 A 2 NA NA NA
# 3 A 3 NA NA NA
# 4 A 4 NA NA NA
# 5 A 5 NA NA NA
# 6 A 6 32 12 4.2
Upvotes: 1