Chuck
Chuck

Reputation: 3852

How to join 2 dataframes, fill missing row values in R?

I have two dataframes:

df1

  ua ub a b c
1 11 12 1 1 0
2 12 13 2 2 2
3 13 14 1 1 1
4 14 15 1 1 1
5 15 16 1 1 1
6 16 17 1 1 1

df2

  ua ub a b c d e f
1 11 12 1 1 0 1 1 1
2 14 15 1 1 1 1 1 1
3 16 17 1 1 1 1 1 1
4 12 13 2 2 2 2 2 2

I want to join the two dataframes on ua and ub, and add on the columns in df2 not in df1 onto df1, and "fill in" the missing rows with a 0.

This would give:

  ua ub a b c d e f
1 11 12 1 1 0 1 1 1
2 12 13 2 2 2 2 2 2
3 13 14 1 1 1 0 0 0
4 14 15 1 1 1 1 1 1
5 15 16 1 1 1 0 0 0
6 16 17 1 1 1 1 1 1

I can get the column names in df2 not in df1 by doing:

setdiff(names(df2), names(df1))

But I'm a bit stuck on how to then merge / concat the remaining results.

I looked at this q here but no luck making it work for me.

In Pandas Python, I could use merge combined with .ffill() but I'm unsure how to translate this into R. How can I do this?

Thanks for your help.

Upvotes: 2

Views: 947

Answers (2)

AntoniosK
AntoniosK

Reputation: 16121

One possible solution is to use all column names that df1 and df2 have in common:

df1 = read.table(text = "
ua ub a b c
1 11 12 1 1 0
2 12 13 2 2 2
3 13 14 1 1 1
4 14 15 1 1 1
5 15 16 1 1 1
6 16 17 1 1 1
", header=T)

df2 = read.table(text = "
  ua ub a b c d e f
1 11 12 1 1 0 1 1 1
2 14 15 1 1 1 1 1 1
3 16 17 1 1 1 1 1 1
4 12 13 2 2 2 2 2 2
", header=T)

library(dplyr)

df1 %>%
  left_join(df2, by=c("ua", "ub", "a", "b", "c")) %>%
  mutate_all(function(x) ifelse(is.na(x), 0, x))

#   ua ub a b c d e f
# 1 11 12 1 1 0 1 1 1
# 2 12 13 2 2 2 2 2 2
# 3 13 14 1 1 1 0 0 0
# 4 14 15 1 1 1 1 1 1
# 5 15 16 1 1 1 0 0 0
# 6 16 17 1 1 1 1 1 1

Another way to do the above without manually specifying the common columns names is this

common_names = intersect(names(df1), names(df2))

df1 %>%
  left_join(df2, by=common_names) %>%
  mutate_all(function(x) ifelse(is.na(x), 0, x))

Otherwise, you can remove any columns from df2 that you don't want to join and have them twice in your final dataset and use by=c("ua", "ub") when you join:

names_to_use = c("ua", "ub", setdiff(names(df2), names(df1)))
df2_upd = df2[,names_to_use]

df1 %>%
  left_join(df2_upd, by=c("ua", "ub")) %>%
  mutate_all(function(x) ifelse(is.na(x), 0, x))

Upvotes: 1

Kushdesh
Kushdesh

Reputation: 1198

Assuming you want to retain all cases of df1 and not df2. Below could be used

library(dplyr)
df3 <- left_join(df1, df2)
df3[is.na(df3)] <- 0

if all cases of df1 as well as df2 is required then replace left_join with full join in above code

Upvotes: 2

Related Questions