Seb
Seb

Reputation: 199

How to create new columns based on pairwise combinations of all existing columns in R?

I have a dataframe that consists of ~6000 columns. Each column contains a country name that represents a node in a graph. Countries within one row are connected via an edge.

This looks like this:

df <- data.frame(ID = c(1,2,3), 
                  Country_1 = c("Germany", "Russia", "Germany"),
                  Country_2 = c(NA, "Germany", NA),
                  Country_n = c("China", "China", "China"))
ID Country_1 Country_2 ... Country_n 
1  Germany   NA        ... China     
2  Russia    Germany   ... China     
3  Germany   NA        ... China     
.
.
.

Aspired Outcome

I want to create new columns that contain the interactions between countries. NAs should be ignored.

ID  Ctr_Int_1       Ctr_Int_2    ...   Ctr_Int_n 
1  Germany-China        NA       ...      NA
2  Russia-Germany   Russia-China ...  Germany-China   
3  Germany-China        NA       ...      NA  
.
.
.

Code

I can do this step by step [1] but NA are not ignored and with increasing numbers of columns it's not really feasible anymore.

library(tidyr)
library(dplyr)

# step by step
df <- df %>% unite(CountryInt_1, Country_1, Country_2, sep = "-", remove = FALSE)
df <- df %>% unite(CountryInt_2, Country_1, Country_n, sep = "-", remove = FALSE)
df <- df %>% unite(CountryInt_3, Country_2, Country_n, sep = "-", remove = FALSE)

# remove additional columns
country_names <- paste0("Country_", 1:3)

`%ni%` <- Negate(`%in%`)
df <- subset(df,select = names(df) %ni% country_names)
ID  Ctr_Int_1       Ctr_Int_2     ...    Ctr_Int_n 
1  Germany-China    Germany-NA    ...    China-NA
2  Russia-Germany   Russia-China  ...    Germany-China   
3  Germany-NA       Germany-China ...    China-NA  
.
.
.

I assume there has to be a solution to this issue or a similar issue as it shouldn't really be unheard of but I wasn't able to find it. I guess there should be an approach using base::apply and/or something similar to what has been done here [2 and here 3] - but I'm not that familiar with data.table and wasn't able to implement it.

If someone could point me in the right direct that would certainly help.

Edit: Thanks to @NotThatKindOdr the issue with the NAs is resolved, however, the more pressing issue remains as it's not feasible to create the Country-to-Country combinations manually each time.

Upvotes: 1

Views: 365

Answers (2)

chinsoon12
chinsoon12

Reputation: 25225

An option using data.table:

library(data.table)
dcast(
    melt(setDT(df), id.vars="ID", na.rm=TRUE)[, 
        combn(value, 2L, function(x) paste(x, collapse="-")), ID][,
            ri := paste0("Ctr_Int_", rowid(ID))],
    ID ~ ri, value.var="V1")

output:

   ID      Ctr_Int_1    Ctr_Int_2     Ctr_Int_3
1:  1  Germany-China         <NA>          <NA>
2:  2 Russia-Germany Russia-China Germany-China
3:  3  Germany-China         <NA>          <NA>

Upvotes: 4

NotThatKindODr
NotThatKindODr

Reputation: 719

It won't ignore the NA but it will replace any combined NA with NA itself

df %>% mutate_all(~ifelse(str_detect(., "NA"), NA, .))

Upvotes: 0

Related Questions