Reputation: 8107
Suppose I have the two following data sets:
library(tidyverse)
iris1 <- iris %>%
as_tibble() %>%
mutate(id = row_number(),
Petal.Length = ifelse(id > 75, NA, Petal.Length),
Petal.Width = ifelse(id > 75, NA, Petal.Width))
iris2 <- iris %>%
as_tibble() %>%
mutate(id = row_number()) %>%
filter(id > 75) %>%
select(id, Petal.Length, Petal.Width)
Here, I've set iris2
to be data that is missing from iris1
. I would like to join the data in iris2
back into iris1
, using id
as the key. I could use dplyr::full_join
and join by id
, but I would have duplicated the variables Petal.Length
and Petal.Width
when in fact I want the values in iris2
integrated into their respective variables in iris1
.
Is there a more elegant solution than doing the below? The example here is with 2 variables (Petal.Length
and Petal.Width
), but in my actual dataset this impacts over a hundred variables, and I'd have to think there's a better way than by manual labour.
left_join(iris1, iris2, by = "id") %>%
mutate(Petal.Length = ifelse(is.na(Petal.Length.x), Petal.Length.y, Petal.Length.x),
Petal.Width = ifelse(is.na(Petal.Width.x), Petal.Width.y, Petal.Width.x)) %>%
select(-Petal.Length.x, -Petal.Length.y, -Petal.Width.x, -Petal.Width.y)
Upvotes: 0
Views: 55
Reputation: 8107
Another potential solution, but this assumes that there are no NA cases prior to the manipulation. Basically, join the two data files, turn them to long, remove the ".x" and ".y" from the variable names, and remove NA rows, and then spread the data back to its original shape:
iris1 %>%
left_join(iris2, by = "id") %>%
gather(key = key, value = value, -id) %>%
mutate(key = str_remove(key, "\\.[xy]$")) %>%
filter(!is.na(value)) %>%
spread(key = key, value = value)
Upvotes: 0
Reputation: 11150
I reduced your data set to 10 rows. I haven't thought this through but this could work.
Here are the steps - 1) added a dataframe identifier field tablename
2) did a union using bind_rows
3) grouped by id
and got the first non-NA value for each column using summarize_all()
. This means that you need to bind tables in order of preference.
iris1 <- head(iris, 10) %>%
as_tibble() %>%
mutate(id = row_number(),
Petal.Length = ifelse(id > 7, NA, Petal.Length),
Petal.Width = ifelse(id > 7, NA, Petal.Width),
tablename = "table1"
)
# A tibble: 10 x 7
Sepal.Length Sepal.Width Petal.Length Petal.Width Species id tablename
<dbl> <dbl> <dbl> <dbl> <fct> <int> <chr>
1 5.10 3.50 1.40 0.200 setosa 1 table1
2 4.90 3.00 1.40 0.200 setosa 2 table1
3 4.70 3.20 1.30 0.200 setosa 3 table1
4 4.60 3.10 1.50 0.200 setosa 4 table1
5 5.00 3.60 1.40 0.200 setosa 5 table1
6 5.40 3.90 1.70 0.400 setosa 6 table1
7 4.60 3.40 1.40 0.300 setosa 7 table1
8 5.00 3.40 NA NA setosa 8 table1
9 4.40 2.90 NA NA setosa 9 table1
10 4.90 3.10 NA NA setosa 10 table1
iris2 <- head(iris, 10) %>%
as_tibble() %>%
mutate(id = row_number(), tablename = "table2") %>%
filter(id > 7) %>%
select(id, Petal.Length, Petal.Width, tablename)
# A tibble: 3 x 4
id Petal.Length Petal.Width tablename
<int> <dbl> <dbl> <chr>
1 8 1.50 0.200 table2
2 9 1.40 0.200 table2
3 10 1.50 0.100 table2
combined <- bind_rows(iris1, iris2) %>%
group_by(id) %>%
summarize_all(function(x) x[!is.na(x)][1])
# A tibble: 10 x 7
# Groups: id [10]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species id tablename
<dbl> <dbl> <dbl> <dbl> <fct> <int> <chr>
1 5.10 3.50 1.40 0.200 setosa 1 table1
2 4.90 3.00 1.40 0.200 setosa 2 table1
3 4.70 3.20 1.30 0.200 setosa 3 table1
4 4.60 3.10 1.50 0.200 setosa 4 table1
5 5.00 3.60 1.40 0.200 setosa 5 table1
6 5.40 3.90 1.70 0.400 setosa 6 table1
7 4.60 3.40 1.40 0.300 setosa 7 table1
8 5.00 3.40 1.50 0.200 setosa 8 table1
9 4.40 2.90 1.40 0.200 setosa 9 table1
10 4.90 3.10 1.50 0.100 setosa 10 table1
Upvotes: 1