GeoCat333
GeoCat333

Reputation: 79

R - replace values in dataframe based on two matching conditions

I'm working with lists of spatial data for 20+ different sites (difficult to reproduce here; sorry in advance). I have three data frames associated with each site; each has a 'sample_ID' column and some other shared columns names.

What I'm trying to do seems very simple: if the 'sample_ID' values match for two data frames and the column names match, replace the value in DF 1 with that of DF 2 and DF 3 three. Example:

# DF 1:
SAMPLE_ID  CLASS_ID  CLASS  VALUE
    1         0        0      5
    2         0        0      5
    3         0        0      3
    4         0        0      6
    5         0        0      6
    6         0        0      3

# DF 2
SAMPLE_ID  REF_VAL  CLASS_ID  CLASS
    1        33        2      cloud
    2        45        3      water
    3        NA        3      water
    4        NA        4      forest

# DF 3
SAMPLE_ID  CLASS_ID  CLASS  STRATA
    5         3       NA      20
    6         3      water    19

Desired output:

# DF 1:
SAMPLE_ID  CLASS_ID  CLASS  VALUE
    1         2      cloud    5
    2         3      water    5
    3         3      water    3
    4         4      forest   6
    5         3       NA      6
    6         3      water    3

All I can think to do is some sort of match indexing, like:

List1$CLASS_ID <- List2$CLASS_ID[match(List1$SAMPLE_ID, List2$SAMPLE_ID)
List1$CLASS_ID <- List3$CLASS_ID[match(List1$SAMPLE_ID, List3$SAMPLE_ID)

But this doesn't work; for one, it produces NAs in the nomatch values (attempted a nested match within the nomatch = but that didn't work either), but more importantly I really need to streamline this by referencing all the matching column names rather than going one at a time since the actual data has 10+ columns that need replacement. Also important, I need the blank NA values to transfer over as well.

Any thoughts?

Upvotes: 0

Views: 3941

Answers (3)

Robert Hijmans
Robert Hijmans

Reputation: 47536

With base R you can do:

vars <- c("SAMPLE_ID", "CLASS_ID", "CLASS")
dt23 <- rbind(dt2[, vars], dt3[, vars])
m <- merge(dt1[, c("SAMPLE_ID","VALUE")], dt23, by="SAMPLE_ID", all.x=TRUE)

Upvotes: 1

twedl
twedl

Reputation: 1648

You have a couple of options, depending on the rest of your application.

Join

You could select ahead of time the columns you'll be replacing, remove them from the original dataset, and dplyr::left_join the new data on:

df1 %>% select(-CLASS_ID, -CLASS) %>% 
        left_join(df2, by = "SAMPLE_ID") %>%
        left_join(df3, by = "SAMPLE_ID")

But if you want to keep values from the original CLASS and CLASS_ID, you can use left_join without removing them, and then use dplyr::coalesce to update the new columns based on the old columns. You might have to use mutate_at or mutate_if, which you can see descriptions here: http://dplyr.tidyverse.org/reference/summarise_all.html.

Or, your original idea:

The bit you were missing is that you want to select the matched IDs on both sides of the assignment. Also %in% usually works well in these cases:

df1[df1$SAMPLE_ID %in% df2$SAMPLE_ID, c("CLASS_ID", "CLASS")] <- df2[df1$SAMPLE_ID %in% df2$SAMPLE_ID, c("CLASS_ID", "CLASS")]

Upvotes: 0

jsta
jsta

Reputation: 3412

I would bind DT2 and DT3 then execute a join:

library(dplyr)

dt1 <- read.table(text = "
SAMPLE_ID  CLASS_ID  CLASS  VALUE
1         0        0      5
2         0        0      5
3         0        0      3
4         0        0      6
5         0        0      6
6         0        0      3
", header = TRUE, stringsAsFactors = FALSE)

dt2 <- read.table(text = "
SAMPLE_ID  REF_VAL  CLASS_ID  CLASS
1        33        2      cloud
2        45        3      water
3        NA        3      water
4        NA        4      forest
", header = TRUE, stringsAsFactors = FALSE)

dt3 <- read.table(text = "
SAMPLE_ID  CLASS_ID  CLASS  STRATA
5         3       NA      20
6         3      water    19
", header = TRUE, stringsAsFactors = FALSE)

dt <- dt1[,c("SAMPLE_ID", "VALUE")]
dt <- left_join(dt, dplyr::bind_rows(dt2, dt3))
dt <- select(dt, SAMPLE_ID, CLASS_ID, CLASS, VALUE)

  SAMPLE_ID CLASS_ID  CLASS VALUE
1         1        2  cloud     5
2         2        3  water     5
3         3        3  water     3
4         4        4 forest     6
5         5        3   <NA>     6
6         6        3  water     3

Upvotes: 0

Related Questions