Reputation: 79
I'm working with lists of spatial data for 20+ different sites (difficult to reproduce here; sorry in advance). I have three data frames associated with each site; each has a 'sample_ID' column and some other shared columns names.
What I'm trying to do seems very simple: if the 'sample_ID' values match for two data frames and the column names match, replace the value in DF 1 with that of DF 2 and DF 3 three. Example:
# DF 1:
SAMPLE_ID CLASS_ID CLASS VALUE
1 0 0 5
2 0 0 5
3 0 0 3
4 0 0 6
5 0 0 6
6 0 0 3
# DF 2
SAMPLE_ID REF_VAL CLASS_ID CLASS
1 33 2 cloud
2 45 3 water
3 NA 3 water
4 NA 4 forest
# DF 3
SAMPLE_ID CLASS_ID CLASS STRATA
5 3 NA 20
6 3 water 19
Desired output:
# DF 1:
SAMPLE_ID CLASS_ID CLASS VALUE
1 2 cloud 5
2 3 water 5
3 3 water 3
4 4 forest 6
5 3 NA 6
6 3 water 3
All I can think to do is some sort of match
indexing, like:
List1$CLASS_ID <- List2$CLASS_ID[match(List1$SAMPLE_ID, List2$SAMPLE_ID)
List1$CLASS_ID <- List3$CLASS_ID[match(List1$SAMPLE_ID, List3$SAMPLE_ID)
But this doesn't work; for one, it produces NAs in the nomatch
values (attempted a nested match
within the nomatch =
but that didn't work either), but more importantly I really need to streamline this by referencing all the matching column names rather than going one at a time since the actual data has 10+ columns that need replacement. Also important, I need the blank NA values to transfer over as well.
Any thoughts?
Upvotes: 0
Views: 3941
Reputation: 47536
With base R you can do:
vars <- c("SAMPLE_ID", "CLASS_ID", "CLASS")
dt23 <- rbind(dt2[, vars], dt3[, vars])
m <- merge(dt1[, c("SAMPLE_ID","VALUE")], dt23, by="SAMPLE_ID", all.x=TRUE)
Upvotes: 1
Reputation: 1648
You have a couple of options, depending on the rest of your application.
You could select ahead of time the columns you'll be replacing, remove them from the original dataset, and dplyr::left_join
the new data on:
df1 %>% select(-CLASS_ID, -CLASS) %>%
left_join(df2, by = "SAMPLE_ID") %>%
left_join(df3, by = "SAMPLE_ID")
But if you want to keep values from the original CLASS
and CLASS_ID
, you can use left_join
without removing them, and then use dplyr::coalesce
to update the new columns based on the old columns. You might have to use mutate_at
or mutate_if
, which you can see descriptions here: http://dplyr.tidyverse.org/reference/summarise_all.html.
The bit you were missing is that you want to select the matched IDs on both sides of the assignment. Also %in%
usually works well in these cases:
df1[df1$SAMPLE_ID %in% df2$SAMPLE_ID, c("CLASS_ID", "CLASS")] <- df2[df1$SAMPLE_ID %in% df2$SAMPLE_ID, c("CLASS_ID", "CLASS")]
Upvotes: 0
Reputation: 3412
I would bind DT2
and DT3
then execute a join:
library(dplyr)
dt1 <- read.table(text = "
SAMPLE_ID CLASS_ID CLASS VALUE
1 0 0 5
2 0 0 5
3 0 0 3
4 0 0 6
5 0 0 6
6 0 0 3
", header = TRUE, stringsAsFactors = FALSE)
dt2 <- read.table(text = "
SAMPLE_ID REF_VAL CLASS_ID CLASS
1 33 2 cloud
2 45 3 water
3 NA 3 water
4 NA 4 forest
", header = TRUE, stringsAsFactors = FALSE)
dt3 <- read.table(text = "
SAMPLE_ID CLASS_ID CLASS STRATA
5 3 NA 20
6 3 water 19
", header = TRUE, stringsAsFactors = FALSE)
dt <- dt1[,c("SAMPLE_ID", "VALUE")]
dt <- left_join(dt, dplyr::bind_rows(dt2, dt3))
dt <- select(dt, SAMPLE_ID, CLASS_ID, CLASS, VALUE)
SAMPLE_ID CLASS_ID CLASS VALUE
1 1 2 cloud 5
2 2 3 water 5
3 3 3 water 3
4 4 4 forest 6
5 5 3 <NA> 6
6 6 3 water 3
Upvotes: 0