eregon
eregon

Reputation: 1516

How to merge factors when binding two dataframes together?

Here is a fairly minimal reproducing code. The real dataset is larger and has many factors, so manually listing factors is not practical. There are also more interesting transformations on the data, for which I want to keep using dplyr.

library(dplyr)
a = data.frame(f=factor(c("a", "b")), g=c("a", "a"))
b = data.frame(f=factor(c("a", "c")), g=c("a", "a"))
a = a %>% group_by(g) %>% mutate(n=1)
b = b %>% group_by(g) %>% mutate(n=2)
rbind(a,b)

This produces:

# A tibble: 4 x 3
# Groups:   g [1]
      f      g     n
  <chr> <fctr> <dbl>
1     a      a     1
2     b      a     1
3     a      a     2
4     c      a     2
Warning messages:
1: In bind_rows_(x, .id) : Unequal factor levels: coercing to character
2: In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector
3: In bind_rows_(x, .id) :
  binding character and factor vector, coercing into character vector

These warnings are annoying, and would actually disappear if I did not use the group_by:

> a = data.frame(f=factor(c("a", "b")), g=c("a", "a"))
> b = data.frame(f=factor(c("a", "c")), g=c("a", "a"))
> a = a %>% mutate(n=1)
> b = b %>% mutate(n=2)
> rbind(a,b)
  f g n
1 a a 1
2 b a 1
3 a a 2
4 c a 2

Explicitly converting to data.frame just before rbind also works:

> rbind(data.frame(a),data.frame(b))
  f g n
1 a a 1
2 b a 1
3 a a 2
4 c a 2

Is there an easy way with base R or dplyr rbind/bind_rows to automatically merge those factors and their levels instead of converting them to character (which makes little sense to me), while still using dplyr for data transformations?

I found https://stackoverflow.com/a/30468468/388803 which proposes a solution to merge the factors manually, but this is very verbose.

My actual use-case is loading two .csv files with read.table, doing some data transformations and then merging the data as they are complementary. My current workaround is to call data.frame(data) at the end of the end of data transformations. I wonder why dplyr/tibble does not automatically merge factors as it seems safe in such a situation. Is this something that could be improved maybe?

Upvotes: 2

Views: 3340

Answers (3)

camille
camille

Reputation: 16832

I came across this question while figuring out a similar task. Using forcats::lvls_union, you can get a character vector of all the levels in a list of factors—in this case, a$f and b$f. Then you can use forcats::fct_expand to expand each data frame's f to have that union of factor levels.

library(tidyverse)

a <- data.frame(f = factor(c("a", "b")), g = c("a")) %>%
  mutate(n = 1) %>%
  group_by(g)

b <- data.frame(f = factor(c("a", "c")), g = c("a")) %>%
  mutate(n = 2) %>%
  group_by(g)

all_lvls <- lvls_union(list(a$f, b$f))

After getting the union of levels, you can mutate both data frames and call bind_rows:

bind_rows(
  a %>% mutate(f = fct_expand(f, all_lvls)),
  b %>% mutate(f = fct_expand(f, all_lvls))
)
#> # A tibble: 4 x 3
#> # Groups:   g [1]
#>   f     g         n
#>   <fct> <fct> <dbl>
#> 1 a     a         1
#> 2 b     a         1
#> 3 a     a         2
#> 4 c     a         2

Or, to get the same result, you can map over a list of the two data frames to expand each f. Using map_dfr is a shorthand, like calling map, then piping into bind_rows.

map_dfr(list(a, b), ~mutate(., f = fct_expand(f, all_lvls)))
#> # A tibble: 4 x 3
#> # Groups:   g [1]
#>   f     g         n
#>   <fct> <fct> <dbl>
#> 1 a     a         1
#> 2 b     a         1
#> 3 a     a         2
#> 4 c     a         2

Created on 2018-07-17 by the reprex package (v0.2.0).

Upvotes: 4

Thomas Wutzler
Thomas Wutzler

Reputation: 255

If the factors are just an efficient storage of strings, one could convert them to strings before merging and convert to factor afterwards:

bind_rowsFactors <- function(
  ### bind_rows on two data.frames with merging factors levels
  a      ##<< first data.frame to bind
  , b    ##<< second data.frame to bind
  , ...  ##<< further arguments to \code{bind_rows}
){
  isInconsistentFactor <- sapply( names(a),  function(col){
    (is.factor(a[[col]]) | is.factor(b[[col]])) &&
      any(levels(a[[col]]) != levels(b[[col]]))
  })
  if (sum(isInconsistentFactor)) warning(
    "releveling factors ", paste(names(a)[isInconsistentFactor], collapse = ","))
  for (col in names(a)[isInconsistentFactor]) {
    a <- mutate(ungroup(a), !!col := as.character(!!rlang::sym(col)))
    b <- mutate(ungroup(b), !!col := as.character(!!rlang::sym(col)))
  }
  ans <- bind_rows(a, b, ...)
  # convert former factors form string back to factor
  for (col in names(ans)[isInconsistentFactor]) {
    ans <- mutate(ungroup(ans), !!col := factor(!!rlang::sym(col)))
  }
  ##value<< result of \code{bind_rows} with inconsistend factor columns still factors
  ans
}

library(dplyr)
a = data.frame(f = factor(c("a", "b")), g = c("a", "a"))
b = data.frame(f = factor(c("a", "c")), g = c("a", "a"))
a = a %>% group_by(g) %>% mutate(n = 1)
b = b %>% group_by(g) %>% mutate(n = 2)
#bind_rows(a,b)
bind_rowsFactors(a,b)

The strange !!rlang::sym notation is just a workaround for non-standard evealuation with dplyr and undefined symbols.

The above code issues a warning on redefining factor levels of f, but otherwise returns the bound data.frame with column f being a factor.

# A tibble: 4 x 3
  f     g         n
  <fct> <fct> <dbl>
1 a     a        1.
2 b     a        1.
3 a     a        2.
4 c     a        2.
Warning message:
In bind_rowsFactors(a, b) : releveling factors f

Upvotes: 2

pogibas
pogibas

Reputation: 28329

Solution using data.table.
Convert your data.frame into a data.table and add n using := (no need of dplyr).

a <- data.frame(f=factor(c("a", "b")), g=c("a", "a"))
b <- data.frame(f=factor(c("a", "c")), g=c("a", "a"))
library(data.table)
rbind(setDT(a)[, n := 1], 
      setDT(b)[, n := 2])
   f g n
1: a a 1
2: b a 1
3: a a 2
4: c a 2

Upvotes: 3

Related Questions