Reputation: 199
I have a dataframe that consists of ~6000 columns. Each column contains a country name that represents a node in a graph. Countries within one row are connected via an edge.
This looks like this:
df <- data.frame(ID = c(1,2,3),
Country_1 = c("Germany", "Russia", "Germany"),
Country_2 = c(NA, "Germany", NA),
Country_n = c("China", "China", "China"))
ID Country_1 Country_2 ... Country_n
1 Germany NA ... China
2 Russia Germany ... China
3 Germany NA ... China
.
.
.
I want to create new columns that contain the interactions between countries. NAs should be ignored.
ID Ctr_Int_1 Ctr_Int_2 ... Ctr_Int_n
1 Germany-China NA ... NA
2 Russia-Germany Russia-China ... Germany-China
3 Germany-China NA ... NA
.
.
.
I can do this step by step [1] but NA are not ignored and with increasing numbers of columns it's not really feasible anymore.
library(tidyr)
library(dplyr)
# step by step
df <- df %>% unite(CountryInt_1, Country_1, Country_2, sep = "-", remove = FALSE)
df <- df %>% unite(CountryInt_2, Country_1, Country_n, sep = "-", remove = FALSE)
df <- df %>% unite(CountryInt_3, Country_2, Country_n, sep = "-", remove = FALSE)
# remove additional columns
country_names <- paste0("Country_", 1:3)
`%ni%` <- Negate(`%in%`)
df <- subset(df,select = names(df) %ni% country_names)
ID Ctr_Int_1 Ctr_Int_2 ... Ctr_Int_n
1 Germany-China Germany-NA ... China-NA
2 Russia-Germany Russia-China ... Germany-China
3 Germany-NA Germany-China ... China-NA
.
.
.
I assume there has to be a solution to this issue or a similar issue as it shouldn't really be unheard of but I wasn't able to find it. I guess there should be an approach using base::apply
and/or something similar to what has been done here [2 and here 3] - but I'm not that familiar with data.table
and wasn't able to implement it.
If someone could point me in the right direct that would certainly help.
Edit: Thanks to @NotThatKindOdr the issue with the NAs is resolved, however, the more pressing issue remains as it's not feasible to create the Country-to-Country combinations manually each time.
Upvotes: 1
Views: 365
Reputation: 25225
An option using data.table
:
library(data.table)
dcast(
melt(setDT(df), id.vars="ID", na.rm=TRUE)[,
combn(value, 2L, function(x) paste(x, collapse="-")), ID][,
ri := paste0("Ctr_Int_", rowid(ID))],
ID ~ ri, value.var="V1")
output:
ID Ctr_Int_1 Ctr_Int_2 Ctr_Int_3
1: 1 Germany-China <NA> <NA>
2: 2 Russia-Germany Russia-China Germany-China
3: 3 Germany-China <NA> <NA>
Upvotes: 4
Reputation: 719
It won't ignore the NA but it will replace any combined NA with NA itself
df %>% mutate_all(~ifelse(str_detect(., "NA"), NA, .))
Upvotes: 0