Prradep
Prradep

Reputation: 5716

Post-processing of full_join output to remove multiplicity

I have two data frames(df1, df2) and performed full_join using the common column of interest col1.

df1 <- data.frame(col1=c('A','D','C','C','E','E','I'),col2=c(4,7,8,3,2,4,9))
df2 <- data.frame(col1=c('A','A','B','C','C','E','E','I'),col2=c(4,1,6,8,3,2,1,9))

df1 %>% full_join(df2, by = "col1")
#    col1 col2.x col2.y
# 1     A      4      4
# 2     A      4      1
# 3     D      7     NA
# 4     C      8      8
# 5     C      8      3
# 6     C      3      8
# 7     C      3      3
# 8     E      2      2
# 9     E      2      1
# 10    E      4      2
# 11    E      4      1
# 12    I      9      9
# 13    B     NA      6

As expected the full_join provides multiplicty of the joining column values and I wish to avoid it. I wish to arrive at the following output. What kind of post-processing approaches do you suggest?

#    col1 col2.x col2.y
# 1     A      4      4
# 2     A     NA      1
# 3     D      7     NA
# 4     C      8      8
# 5     C      3      3
# 6     E      2      2
# 7     E      4      1
# 8     I      9      9
# 9     B     NA      6

More information:

Case 1: I do not need four rows in the output for two same values in both input objects:

# 4     C      8      8
# 5     C      8      3
# 6     C      3      8
# 7     C      3      3

instead, I want only two as:

# 4     C      8      8
# 5     C      3      3

Case 2: Similarly, I need same row for the difference in values:

# 8     E      2      2
# 9     E      2      1
# 10    E      4      2
# 11    E      4      1

instead, I want only two rows as below:

# 8     E      2      2
# 9     E      4      1

Upvotes: 1

Views: 85

Answers (1)

Jaap
Jaap

Reputation: 83245

A possible solution in 2 steps using the -package:

0) load package & convert to data.table's

library(data.table)
setDT(df1)
setDT(df2)

1) define helper function

unlistSD <- function(x) {
  l <- length(x)
  ls <- sapply(x, lengths)
  m <- max(ls)
  newSD <- vector(mode = "list", length = l)
  for (i in 1:l) {
    u <- unlist(x[[i]])
    lu <- length(u)
    if (lu < m) {
      u <- c(u, rep(NA_real_, m - lu))
    }
    newSD[[i]] <- u
  }
  return(setNames(as.list(newSD), names(x)))
}

2) merge and apply helper function

merge(df1[, .(col2 = list(col2)), by = col1],
      df2[, .(col2 = list(col2)), by = col1],
      by = "col1", all = TRUE
      )[, unlistSD(.SD), by = col1]

which gives the following result:

   col1 col2.x col2.y
1:    A      4      4
2:    A     NA      1
3:    C      8      8
4:    C      3      3
5:    D      7     NA
6:    E      2      2
7:    E      4      1
8:    I      9      9
9:    B     NA      6

Another possibiliy with base R:

unlistDF <- function(d, groupcols) {
  ds <- split(d[, setdiff(names(d), groupcols)], d[,groupcols])
  ls <- lapply(ds, function(x) max(sapply(x, lengths)))
  dl <- lapply(ds, function(x) lapply(as.list(x), unlist))
  du <- Map(function(x, y) {
    lapply(x, function(i) {
      if(length(i) < y) {
        c(i, rep(NA_real_, y - length(i)))
      } else i
    })
    
  }, x = dl, y = ls)
  ld <- lapply(du, as.data.frame)
  cbind(d[rep(1:nrow(d), ls), groupcols, drop = FALSE],
        do.call(rbind.data.frame, c(ld, make.row.names = FALSE)),
        row.names = NULL)
}

Now you can use this function as follows in combination with merge:

df <- merge(aggregate(col2 ~ col1, df1, as.list),
            aggregate(col2 ~ col1, df2, as.list),
            by = "col1", all = TRUE)
unlistDF(df, "col1")

Upvotes: 2

Related Questions