Nate
Nate

Reputation: 59

Is there a way to replace rows in one dataframe with another in R?

I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.

Ex:

df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))

df2 <- data.frame(x = c(1,2), y = c("f", "g"))

Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.

data.frame(x = c(1,2,3,4), y = c("f","g","c","d")

I've been working on this for a while and this is the closest I've gotten -

df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y

But it just replaces the values with NA.

Does anyone know how to do this?

Upvotes: 5

Views: 1757

Answers (4)

MrGumble
MrGumble

Reputation: 5766

First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!

You have several options, but lets look at why your solution doesn't quite work:

First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:

Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
  invalid factor level, NA generated

If we look at your data frames df1 and df2 with the str function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.

Running your code gives a warning because you are trying to import new factors (labels) into df1 that don't exist. And R doesn't know what to do with them, so it just inserts NA-values.

As r2evens answered, he used the stringsAsFactors to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE) (and I've heard it will be disabled as default in forthcoming R4.0 - yay!).

After disabling stringsAsFactors, your code works - or does it? Try this on for size:

df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y

What's in df1 now? Not quite right anymore.

In the first line, I swapped the two rows in df2 and lo and behold, the replaced values in df1 were swapped. Why is that?

Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y

Call df1$x %in% df2$x returns a logical vector (boolean) of which elements in df1$x are found ind df2 - i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.

Calling which(df1$x %in% df2$x) then reduces the logical vector to which indices were TRUE. Again, we do not now which elements correspond to which.

For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table or dplyr are two powerful packages to get to know).

In his solution, he uses merge to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform, he assigns new variables within the context of the data.frame returned from the merge function called in the first argument.

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 388982

We can use match. :

inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1

#  x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d

Upvotes: 3

akrun
akrun

Reputation: 887153

A join option with data.table where we join on the 'x' column, assign the values of 'y' in second dataset (i.y) to the first one with :=

library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]

NOTE: It is better to use stringsAsFactors = FALSE (in R 4.0.0 - it is by default though) or else we need to have all the levels common in both datasets

Upvotes: 2

r2evans
r2evans

Reputation: 160447

I think what you need here is a "merge" or "join" operation.

(I add stringsAsFactors=FALSE to the frames so that the merging and later work is without any issue, as factors can be disruptive sometimes.)

Base R:

df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
#   x y.x  y.y
# 1 1   a    f
# 2 2   b    g
# 3 3   c <NA>
# 4 4   d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
#   x y.x  y.y y
# 1 1   a    f f
# 2 2   b    g g
# 3 3   c <NA> c
# 4 4   d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
#   x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d

Dplyr:

library(dplyr)
full_join(df1, df2, by = "x") %>%
  mutate(y = coalesce(y.y, y.x)) %>%
  select(-y.x, -y.y)
#   x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d

Upvotes: 2

Related Questions