Dale Kube
Dale Kube

Reputation: 1460

Performing a dplyr full_join without a common variable to blend data frames

Using the dplyr full_join() operation, I am trying to perform the equivalent of a basic merge() operation in which no common variables exist (unable to satisfy the "by=" argument). This will blend two data frames and return all possible combinations.

However, the current full_join() function requires a common variable. I am unable to locate another dplyr function that can help with this. How can I perform this operation using functions specific to the dplyr library?

df_a = data.frame(department=c(1,2,3,4))
df_b = data.frame(period=c(2014,2015,2016,2017))

#This works as desired
big_df = merge(df_a,df_b)

#I'd like to perform the following in a much bigger operation:
big_df = dplyr::full_join(df_a,df_b)

#Error: No common variables. Please specify `by` param.

Upvotes: 15

Views: 10758

Answers (2)

dipetkov
dipetkov

Reputation: 3700

If there are duplicate rows, crossing doesn't give the same result as merge.

Instead use full_join with by = character() to perform a cross-join which generates all combinations of df_a and df_b.

library("tidyverse") # version 1.3.2

# Add duplicate rows for illustration.
df_a <- tibble(department = c(1, 2, 3, 3))
df_b <- tibble(period = c(2014, 2015, 2016, 2017))

merge doesn't de-duplicate.

df_a_merge_b <- merge(df_a, df_b)
df_a_merge_b
#>    department period
#> 1           1   2014
#> 2           2   2014
#> 3           3   2014
#> 4           3   2014
#> 5           1   2015
#> 6           2   2015
#> 7           3   2015
#> 8           3   2015
#> 9           1   2016
#> 10          2   2016
#> 11          3   2016
#> 12          3   2016
#> 13          1   2017
#> 14          2   2017
#> 15          3   2017
#> 16          3   2017

crossing drops duplicate rows.

df_a_crossing_b <- crossing(df_a, df_b)
df_a_crossing_b
#> # A tibble: 12 × 2
#>    department period
#>         <dbl>  <dbl>
#>  1          1   2014
#>  2          1   2015
#>  3          1   2016
#>  4          1   2017
#>  5          2   2014
#>  6          2   2015
#>  7          2   2016
#>  8          2   2017
#>  9          3   2014
#> 10          3   2015
#> 11          3   2016
#> 12          3   2017

full_join doesn't remove duplicates either.

df_a_full_join_b <- full_join(df_a, df_b, by = character())
df_a_full_join_b
#> # A tibble: 16 × 2
#>    department period
#>         <dbl>  <dbl>
#>  1          1   2014
#>  2          1   2015
#>  3          1   2016
#>  4          1   2017
#>  5          2   2014
#>  6          2   2015
#>  7          2   2016
#>  8          2   2017
#>  9          3   2014
#> 10          3   2015
#> 11          3   2016
#> 12          3   2017
#> 13          3   2014
#> 14          3   2015
#> 15          3   2016
#> 16          3   2017

packageVersion("tidyverse")
#> [1] '1.3.2'

Created on 2023-01-13 with reprex v2.0.2

Upvotes: 0

Pierre Lapointe
Pierre Lapointe

Reputation: 16287

You can use crossing from tidyr:

crossing(df_a,df_b)

   department period
1           1   2014
2           1   2015
3           1   2016
4           1   2017
5           2   2014
6           2   2015
7           2   2016
8           2   2017
9           3   2014
10          3   2015
11          3   2016
12          3   2017
13          4   2014
14          4   2015
15          4   2016
16          4   2017

Upvotes: 40

Related Questions