Sarah Hays
Sarah Hays

Reputation: 33

How to take data from one dataframe and copy it into existing columns in another dataframe based on the shared ID of a third column

So what I have is two different data frames: the one I've been working on (df1) and the one with all the new data I need to put in the first one (df2). Df1 has several columns of zeroes, waiting for the data to be added in. Df2 has the data I need, and several more rows and columns that I don't care about beyond that data. Here is a small subset of the type of data I'm working with.

This is my first time posting my data so I hope I'm doing it right. Let me know if you need a different format.

df1:

structure(list(season = c(" FA15", " FA15", " FA15", " FA15", 
" FA15", " FA15", " FA15", " FA15", " FA15", " FA15"), year = c("2015", 
"2015", "2015", "2015", "2015", "2015", "2015", "2015", "2015", 
"2015"), territory.name = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), plot = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), color.band = c("APGBY", 
"APGGU", "APGPW", "APGPW", "APGR", "APGUO", "APGUO", "APGUO", 
"APGUO", "APGYR")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

df2:

structure(list(bandnum = c(157328052, 160379101, 157328094, 151313455, 
170364680, 160379104, 151373458, 157328066, 160379103, 160379105
), project = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L), .Label = c("*ISSJ", "ISSJ"), class = "factor"), color.band = c("PAWR", 
"WYWAR", "APGP", "APGO", "ABYG", "URYAR", "APBW", "WABG", "OBWAR", 
"GBGAR"), sex = structure(c(3L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 
2L), .Label = c("?", "F", "M"), class = "factor"), age = structure(c(2L, 
1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L), .Label = c("AHY", "ASY", 
"HY", "N", "SY"), class = "factor")), row.names = c(NA, 10L), class = "data.frame")

I've been chewing on this problem for a few days, trying different things and reading so many answers on stack overflow, but I'm failing to come up with a clear answer on how to take data from one dataframe and copy it into existing columns in another dataframe based on the shared ID of a third column.

Pretty much, I want r to see that both data frames have in the color.band column a listing for the band ABCDEF, and then take the value from df2$bandnum in the same row as ABCDEF and copy it to df1$bandnum in the ABCDEF row there.

I don't want to copy rows that are in df2 but not df1 into df1. I want to mark entries that exist in df1 but not df2 as N/A in the bandnum column.

Column names and data format for color band and band number have been standardized between the two data frames so everything should line up. What I have so far with code is this:

> practicedf <- left_join(x=df1, y=df2, by = "color.band", all.x = TRUE) 
%>% mutate(y = ifelse(is.na(df1$color.band), df1$bandnum, df1$color.band)) %>% select(df2$bandnum)

left_join seems to be the right one because it keeps all rows in the left (df1) data frame and only matching rows from the right (df2) data frame. I get this error though:

Error in `[[<-.data.frame`(`*tmp*`, col, value = c("APGBY", "APGGU", "APGPW", : 
replacement has 1261 rows, data has 2559

color.band is a character vector while bandnum is numerical, is that a problem? What could be the problem here?


Edit: I had an error with having the column bandnum in both dataframes so I changed df2$bandnum to bandnum.y. My code is now

df1_test <- left_join(x=df1, y=df2, by = "color.band") %>% mutate(y =
 ifelse(is.na(color.band), bandnum, color.band)) %>% select(bandnum.y) 

but when I view(df1_test) it only shows me the column bandnum.y and it's not the same number of entries as my original df1

Here's a subset of df1_test (not the whole thing because it's 2600 entries)

Any way I can make it show the rest of my data as well?

structure(list(bandnum.y = c("171324972", "171324972", "171324972", 
"178324697", "178324697", "178324697", "178324697", "178324697", 
"178324697", "178324697", "170364505", "170364505", "170364505", 
"170364505", "170364505", "170364505", NA, "178324692", "178324692", 
"178324692")), row.names = c(NA, -20L), class = c("tbl_df", "tbl", 
"data.frame"))

Upvotes: 2

Views: 1162

Answers (3)

Billy Shaw
Billy Shaw

Reputation: 39

If I don't get you wrong, you want to update an old df (df1) with information from a new df (df2). In data.table, you can try this:

libraty(data.table)
setDT(df1)
setDT(df2)
update.vars = intersect(names(df1), names(df2))  # update only common variables

df1[df2, c(update.vars) := df2[,update.vars, with=FALSE], on= 'color.band'] 

Generally this should work. But in the given data the 'merge' ids (color.band column) are not unique, which may affect the results.

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389325

left_join does not have all.x = TRUE that is part of base R merge.

You could do the following in base R :

df1_test <- transform(merge(df1, df2, by = "color.band", all.x = TRUE), 
                      y = ifelse(is.na(color.band), bandnum, color.band))

Upvotes: 1

akrun
akrun

Reputation: 887951

We cannot use the original dataset 'df1' columns after the join becuase it is a left_join. In tidyverse, we specify the unquoted column names. There is no all.x argument in left_join. It should be from merge

library(dplyr)
left_join(x=df1, y=df2, by = "color.band") %>% 
     mutate(y = ifelse(is.na(color.band), bandnum, color.band)) 

Upvotes: 1

Related Questions