Reputation: 59
I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.
Ex:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))
df2 <- data.frame(x = c(1,2), y = c("f", "g"))
Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.
data.frame(x = c(1,2,3,4), y = c("f","g","c","d")
I've been working on this for a while and this is the closest I've gotten -
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
But it just replaces the values with NA.
Does anyone know how to do this?
Upvotes: 5
Views: 1757
Reputation: 5766
First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!
You have several options, but lets look at why your solution doesn't quite work:
First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
invalid factor level, NA generated
If we look at your data frames df1
and df2
with the str
function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.
Running your code gives a warning because you are trying to import new factors (labels) into df1
that don't exist. And R doesn't know what to do with them, so it just inserts NA
-values.
As r2evens answered, he used the stringsAsFactors
to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE)
(and I've heard it will be disabled as default in forthcoming R4.0 - yay!).
After disabling stringsAsFactors
, your code works - or does it? Try this on for size:
df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
What's in df1
now? Not quite right anymore.
In the first line, I swapped the two rows in df2
and lo and behold, the replaced values in df1
were swapped. Why is that?
Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y
Call df1$x %in% df2$x
returns a logical vector (boolean) of which elements in df1$x
are found ind df2
- i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.
Calling which(df1$x %in% df2$x)
then reduces the logical vector to which indices were TRUE
. Again, we do not now which elements correspond to which.
For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table
or dplyr
are two powerful packages to get to know).
In his solution, he uses merge
to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform
, he assigns new variables within the context of the data.frame returned from the merge
function called in the first argument.
Upvotes: 3
Reputation: 388982
We can use match
. :
inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1
# x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d
Upvotes: 3
Reputation: 887153
A join option with data.table
where we join on
the 'x' column, assign the values of 'y' in second dataset (i.y
) to the first one with :=
library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]
NOTE: It is better to use stringsAsFactors = FALSE
(in R 4.0.0
- it is by default though) or else we need to have all the levels
common in both datasets
Upvotes: 2
Reputation: 160447
I think what you need here is a "merge" or "join" operation.
(I add stringsAsFactors=FALSE
to the frames so that the merging and later work is without any issue, as factor
s can be disruptive sometimes.)
Base R:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
# x y.x y.y
# 1 1 a f
# 2 2 b g
# 3 3 c <NA>
# 4 4 d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
# x y.x y.y y
# 1 1 a f f
# 2 2 b g g
# 3 3 c <NA> c
# 4 4 d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Dplyr:
library(dplyr)
full_join(df1, df2, by = "x") %>%
mutate(y = coalesce(y.y, y.x)) %>%
select(-y.x, -y.y)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Upvotes: 2