Reputation: 2341
I have a (large) data frame as follows:
library(data.table)
DT <- fread(
"ID country year A b B a
4 NLD 2002 NA 1 NA 0
5 NLD 2002 NA 0 NA 1
6 NLD 2006 NA 1 NA 1
7 NLD 2006 NA 0 NA 0
8 NLD 2006 0 NA 0 NA
9 GBR 2002 0 NA 0 NA
10 GBR 2002 0 NA 0 NA
11 GBR 2002 0 NA 0 NA
12 GBR 2006 1 NA 1 NA
13 GBR 2006 1 NA 0 NA",
header = TRUE)
I would simply like to merge variables A
and a
, and, B
and b
.
EDIT: The problem is that I have to do this for more than 1000 variables, so I would like to avoid specifying either the column names which need not to be checked or the ones that do.
I was hoping for a solution that first splits the columns into a group for which there is no non-capitalised alternative, and a group for which there is.
As far as I understand the solution here:
Coalesce columns based on pattern in R
It still requires to provide the variables names for which the case needs to be checked. If I misunderstand this solution, which is very much possible, please let me know. In any case, as explained, I need a solution without specifically specifying the variables.
I found a good start here.
That solution has however a slightly different approach than the one I need.
How do I make such a variable merge conditional on something like tolower(varname) == varname
?
Desired output:
DT <- fread(
"ID country year A B
4 NLD 2002 0 1
5 NLD 2002 1 0
6 NLD 2006 1 1
7 NLD 2006 0 0
8 NLD 2006 0 0
9 GBR 2002 0 0
10 GBR 2002 0 0
11 GBR 2002 0 0
12 GBR 2006 1 1
13 GBR 2006 1 0 ",
header = TRUE)
Upvotes: 3
Views: 337
Reputation: 42544
The OP is using data.table
, so the question deserves a data.table
answer.
The approach below is similar to sindri_baldur's answer in general but differs in important details. In particular,
"CC", "cc", "cC"
covering the different ways of writing variable names, e.g., upper case, lower case, as well as lower and upper camel case.library(data.table)
library(magrittr) # piping is used to improve readability
names(DT) %>%
data.table(orig = ., lc = tolower(.)) %>%
.[, {
if (.N > 1L) {
new <- toupper(.BY)
old <- setdiff(orig, new)
DT[, (new) := fcoalesce(.SD), .SDcols = orig]
DT[, (old) := NULL]
sprintf("Coalesced %s onto %s", toString(old), new)
}
}, by = lc]
DT[]
lc V1 1: a Coalesced a onto A 2: b Coalesced b onto B
DT[]
ID country year A B 1: 4 NLD 2002 0 1 2: 5 NLD 2002 1 0 3: 6 NLD 2006 1 1 4: 7 NLD 2006 0 0 5: 8 NLD 2006 0 0 6: 9 GBR 2002 0 0 7: 10 GBR 2002 0 0 8: 11 GBR 2002 0 0 9: 12 GBR 2006 1 1 10: 13 GBR 2006 1 0
For another use case
DT2 <- fread(
"ID country year A b B a CC cc cC
4 NLD 2002 NA 1 NA 0 1 NA NA
5 NLD 2002 NA 0 NA 1 NA 2 NA
6 NLD 2006 NA 1 NA 1 NA NA 3
7 NLD 2006 NA 0 NA 0 NA NA NA
8 NLD 2006 0 NA 0 NA 1 NA NA
9 GBR 2002 0 NA 0 NA NA 2 NA
10 GBR 2002 0 NA 0 NA NA NA 3
11 GBR 2002 0 NA 0 NA 1 NA NA
12 GBR 2006 1 NA 1 NA NA 2 NA
13 GBR 2006 1 NA 0 NA NA NA 3",
header = TRUE)
DT <- copy(DT2)
above code returns
lc V1 1: a Coalesced a onto A 2: b Coalesced b onto B 3: cc Coalesced cc, cC onto CC
DT[]
ID country year A B CC 1: 4 NLD 2002 0 1 1 2: 5 NLD 2002 1 0 2 3: 6 NLD 2006 1 1 3 4: 7 NLD 2006 0 0 NA 5: 8 NLD 2006 0 0 1 6: 9 GBR 2002 0 0 2 7: 10 GBR 2002 0 0 3 8: 11 GBR 2002 0 0 1 9: 12 GBR 2006 1 1 2 10: 13 GBR 2006 1 0 3
lc
of the lower case versions of the column names.for
loop we use grouping by =
and data.table
's feature to evaluate any expression, even with side effects. So, DT
is updated by reference for each distinct value of lc
within the scope of the data.table
which was created on-the-fly in step 1 but only if there is more than one column in the group.This approach can be extended to coalesce columns which use underscore, dots, or blanks "_", ".", " "
in its column names, e.g., "var_1", "VAR.1", "Var 1"
.
Upvotes: 1
Reputation: 33488
A data.table
-only solution - using a simple loop instead of reshaping data:
all_cols <- names(DT)
cols <- grep("[A-Z]", all_cols, value = TRUE)
for (col in cols) {
snc <- all_cols[all_cols == tolower(col)]
if (length(snc)) {
DT[, (col) := fcoalesce(.SD), .SDcols = c(snc, col)]
DT[, (setdiff(snc, col)) := NULL]
}
}
> DT[]
ID country year A B
1: 4 NLD 2002 0 1
2: 5 NLD 2002 1 0
3: 6 NLD 2006 1 1
4: 7 NLD 2006 0 0
5: 8 NLD 2006 0 0
6: 9 GBR 2002 0 0
7: 10 GBR 2002 0 0
8: 11 GBR 2002 0 0
9: 12 GBR 2006 1 1
10: 13 GBR 2006 1 0
Upvotes: 2
Reputation: 1716
I can offer a solution using tidyverse functions. This is essentially the same as the solution offered by AntoniosK, but pivot_longer
and pivot_wider
are the preferred alternatives to spread
and gather
.
library(dplyr)
library(tidyr)
DT %>%
mutate(UNIQUEID = row_number()) %>%
mutate_all(as.character) %>%
pivot_longer(cols = -UNIQUEID) %>%
mutate(name = stringr::str_to_upper(name)) %>%
filter(!is.na(value)) %>%
pivot_wider(names_from = name, values_from = value) %>%
type.convert(as.is=TRUE) %>% select(-UNIQUEID)
h/t @dario for the great suggestions.
Upvotes: 3
Reputation: 16121
Assuming that your example dataset represents your general case, this should work:
library(data.table)
library(tidyverse)
DT <- fread(
"ID country year A b B a
4 NLD 2002 NA 1 NA 0
5 NLD 2002 NA 0 NA 1
6 NLD 2006 NA 1 NA 1
7 NLD 2006 NA 0 NA 0
8 NLD 2006 0 NA 0 NA
9 GBR 2002 0 NA 0 NA
10 GBR 2002 0 NA 0 NA
11 GBR 2002 0 NA 0 NA
12 GBR 2006 1 NA 1 NA
13 GBR 2006 1 NA 0 NA",
header = TRUE)
# spot the column names to keep as they are
data.frame(x = names(DT), stringsAsFactors = F) %>% # get actual column names of the dataset
mutate(y = toupper(x)) %>% # get the upper values
group_by(y) %>% # for each upper value
filter(n() == 1) %>% # count them and keep only the unique columns
pull(x) -> fix_cols # store unique column names
DT %>%
gather(col_name, value, -fix_cols) %>% # reshape dataset
mutate(col_name = toupper(col_name)) %>% # change column names to upper case
na.omit() %>% # remove NA rows
spread(col_name, value) # reshape again
# ID country year A B
# 1 4 NLD 2002 0 1
# 2 5 NLD 2002 1 0
# 3 6 NLD 2006 1 1
# 4 7 NLD 2006 0 0
# 5 8 NLD 2006 0 0
# 6 9 GBR 2002 0 0
# 7 10 GBR 2002 0 0
# 8 11 GBR 2002 0 0
# 9 12 GBR 2006 1 1
# 10 13 GBR 2006 1 0
Upvotes: 1