Tom
Tom

Reputation: 2341

Coalesce columns with capitalised and non capitalised versions of variables names without specifying the variable names

I have a (large) data frame as follows:

library(data.table)
DT <- fread(
"ID country year A b B a
4   NLD   2002  NA   1   NA   0
5   NLD   2002  NA   0   NA   1
6   NLD   2006  NA   1   NA   1
7   NLD   2006  NA   0   NA   0
8   NLD   2006  0   NA   0   NA
9   GBR   2002  0   NA   0   NA
10  GBR   2002  0   NA   0   NA
11  GBR   2002  0   NA   0   NA
12  GBR   2006  1   NA   1   NA
13  GBR   2006  1   NA   0   NA",
header = TRUE)

I would simply like to merge variables A and a, and, B and b.

EDIT: The problem is that I have to do this for more than 1000 variables, so I would like to avoid specifying either the column names which need not to be checked or the ones that do.

I was hoping for a solution that first splits the columns into a group for which there is no non-capitalised alternative, and a group for which there is.

As far as I understand the solution here:

Coalesce columns based on pattern in R

It still requires to provide the variables names for which the case needs to be checked. If I misunderstand this solution, which is very much possible, please let me know. In any case, as explained, I need a solution without specifically specifying the variables.

I found a good start here.

That solution has however a slightly different approach than the one I need.

How do I make such a variable merge conditional on something like tolower(varname) == varname ?

Desired output:

DT <- fread(
"ID country year A B
4   NLD   2002  0  1
5   NLD   2002  1  0
6   NLD   2006  1  1
7   NLD   2006  0  0 
8   NLD   2006  0  0 
9   GBR   2002  0  0 
10  GBR   2002  0  0
11  GBR   2002  0  0
12  GBR   2006  1  1 
13  GBR   2006  1  0 ",
header = TRUE)

Upvotes: 3

Views: 337

Answers (4)

Uwe
Uwe

Reputation: 42544

The OP is using data.table, so the question deserves a data.table answer.

The approach below is similar to sindri_baldur's answer in general but differs in important details. In particular,

  • it will also coalesce multiple columns like "CC", "cc", "cC" covering the different ways of writing variable names, e.g., upper case, lower case, as well as lower and upper camel case.
  • it will return a description of the columns which have been coalesced.
library(data.table)
library(magrittr) # piping is used to improve readability
names(DT) %>% 
  data.table(orig = ., lc = tolower(.)) %>% 
  .[, {
    if (.N > 1L) {
      new <- toupper(.BY)
      old <- setdiff(orig, new)
      DT[, (new) := fcoalesce(.SD), .SDcols = orig]
      DT[, (old) := NULL]
      sprintf("Coalesced %s onto %s", toString(old), new)
    }
  }, by = lc]
DT[]
   lc                 V1
1:  a Coalesced a onto A
2:  b Coalesced b onto B
DT[]
    ID country year A B
 1:  4     NLD 2002 0 1
 2:  5     NLD 2002 1 0
 3:  6     NLD 2006 1 1
 4:  7     NLD 2006 0 0
 5:  8     NLD 2006 0 0
 6:  9     GBR 2002 0 0
 7: 10     GBR 2002 0 0
 8: 11     GBR 2002 0 0
 9: 12     GBR 2006 1 1
10: 13     GBR 2006 1 0

For another use case

DT2 <- fread(
  "ID country year A b B a CC cc cC
4   NLD   2002  NA   1   NA   0   1  NA  NA    
5   NLD   2002  NA   0   NA   1  NA   2  NA
6   NLD   2006  NA   1   NA   1  NA  NA   3
7   NLD   2006  NA   0   NA   0  NA  NA  NA  
8   NLD   2006  0   NA   0   NA   1  NA  NA
9   GBR   2002  0   NA   0   NA  NA   2  NA
10  GBR   2002  0   NA   0   NA  NA  NA   3
11  GBR   2002  0   NA   0   NA   1  NA  NA
12  GBR   2006  1   NA   1   NA  NA   2  NA
13  GBR   2006  1   NA   0   NA  NA  NA   3",
  header = TRUE)
DT <- copy(DT2)

above code returns

   lc                       V1
1:  a       Coalesced a onto A
2:  b       Coalesced b onto B
3: cc Coalesced cc, cC onto CC
DT[]
    ID country year A B CC
 1:  4     NLD 2002 0 1  1
 2:  5     NLD 2002 1 0  2
 3:  6     NLD 2006 1 1  3
 4:  7     NLD 2006 0 0 NA
 5:  8     NLD 2006 0 0  1
 6:  9     GBR 2002 0 0  2
 7: 10     GBR 2002 0 0  3
 8: 11     GBR 2002 0 0  1
 9: 12     GBR 2006 1 1  2
10: 13     GBR 2006 1 0  3

Explanation

  1. The column names are turned into a data.table with a additional column lc of the lower case versions of the column names.
  2. Instead of a for loop we use grouping by = and data.table's feature to evaluate any expression, even with side effects. So, DT is updated by reference for each distinct value of lc within the scope of the data.table which was created on-the-fly in step 1 but only if there is more than one column in the group.

Future extensions

This approach can be extended to coalesce columns which use underscore, dots, or blanks "_", ".", " "in its column names, e.g., "var_1", "VAR.1", "Var 1".

Upvotes: 1

s_baldur
s_baldur

Reputation: 33488

A data.table-only solution - using a simple loop instead of reshaping data:

all_cols <- names(DT)
cols <- grep("[A-Z]", all_cols, value = TRUE)
for (col in cols) {
  snc <- all_cols[all_cols == tolower(col)]
  if (length(snc)) {
    DT[, (col) := fcoalesce(.SD), .SDcols = c(snc, col)]
    DT[, (setdiff(snc, col)) := NULL]
  }
}


> DT[]
    ID country year A B
 1:  4     NLD 2002 0 1
 2:  5     NLD 2002 1 0
 3:  6     NLD 2006 1 1
 4:  7     NLD 2006 0 0
 5:  8     NLD 2006 0 0
 6:  9     GBR 2002 0 0
 7: 10     GBR 2002 0 0
 8: 11     GBR 2002 0 0
 9: 12     GBR 2006 1 1
10: 13     GBR 2006 1 0

Upvotes: 2

John J.
John J.

Reputation: 1716

I can offer a solution using tidyverse functions. This is essentially the same as the solution offered by AntoniosK, but pivot_longer and pivot_wider are the preferred alternatives to spread and gather.

library(dplyr)
library(tidyr)

DT %>% 
    mutate(UNIQUEID = row_number()) %>% 
    mutate_all(as.character) %>% 
    pivot_longer(cols = -UNIQUEID) %>% 
    mutate(name = stringr::str_to_upper(name)) %>% 
    filter(!is.na(value)) %>% 
    pivot_wider(names_from = name, values_from = value) %>% 
    type.convert(as.is=TRUE) %>% select(-UNIQUEID)

h/t @dario for the great suggestions.

Upvotes: 3

AntoniosK
AntoniosK

Reputation: 16121

Assuming that your example dataset represents your general case, this should work:

library(data.table)
library(tidyverse)

DT <- fread(
  "ID country year A b B a
4   NLD   2002  NA   1   NA   0
5   NLD   2002  NA   0   NA   1
6   NLD   2006  NA   1   NA   1
7   NLD   2006  NA   0   NA   0
8   NLD   2006  0   NA   0   NA
9   GBR   2002  0   NA   0   NA
10  GBR   2002  0   NA   0   NA
11  GBR   2002  0   NA   0   NA
12  GBR   2006  1   NA   1   NA
13  GBR   2006  1   NA   0   NA",
  header = TRUE)

# spot the column names to keep as they are
data.frame(x = names(DT), stringsAsFactors = F) %>%  # get actual column names of the dataset
  mutate(y = toupper(x)) %>%                         # get the upper values
  group_by(y) %>%                                    # for each upper value
  filter(n() == 1) %>%                               # count them and keep only the unique columns
  pull(x) -> fix_cols                                # store unique column names

DT %>%
  gather(col_name, value, -fix_cols) %>%             # reshape dataset
  mutate(col_name = toupper(col_name)) %>%           # change column names to upper case
  na.omit() %>%                                      # remove NA rows
  spread(col_name, value)                            # reshape again

#    ID country year A B
# 1   4     NLD 2002 0 1
# 2   5     NLD 2002 1 0
# 3   6     NLD 2006 1 1
# 4   7     NLD 2006 0 0
# 5   8     NLD 2006 0 0
# 6   9     GBR 2002 0 0
# 7  10     GBR 2002 0 0
# 8  11     GBR 2002 0 0
# 9  12     GBR 2006 1 1
# 10 13     GBR 2006 1 0

Upvotes: 1

Related Questions