melbez
melbez

Reputation: 1000

Combining columns in R based on matching beginnings of column title names

I have a dataframe that looks somewhat like the following. A1U_sweet is actually the 19th column in the real dataframe, and C1U_sweet is the 39th column in the real dataframe. There are 20 columns beginning with A## and 20 beginning with C##.

A1U_sweet  A2F_dip  A3U_bbq  C1U_sweet  C2F_dip  C3U_bbq
1          2        1        NA         NA       NA
NA         NA       NA       4          1        2
2          4        7        NA         NA       NA

I would like to make additional columns that combine the A values and the C values. The resulting dataframe would include columns looking like B1U_sweet and B2F_dip.

A1U_sweet  A2F_dip  A3U_bbq  C1U_sweet  C2F_dip  C3U_bbq  B1U_sweet  B2F_dip
1          2        1        NA         NA       NA       1          2
NA         NA       NA       4          1        2        4          1
2          4        7        NA         NA       NA       2          4

Someone proposed I try the following code. The first two lines work, but after implementing the rest, I get an error message.

types <- grep('^A([0-9]|[12][0-9])[A-Z]_[a-z]+', names(df)) ## Get all "A" 
patterns
types <- substr(types, 2, Inf) ## Remove the "A"
for (tp in types) {
  aa <- df[[paste0('A', tp)]] ## "A" column
  cc <- df[[paste0('C', tp)]] ## "C" column
  df[[paste0('B', tp)]] <- ifelse(is.na(aa), aa, cc)
}

This is the error message:

Error in `[[<-.data.frame`(`*tmp*`, paste0("B", tp), value = logical(0)) : 
  replacement has 0 rows, data has 94
In addition: Warning message:
In is.na(aa) : is.na() applied to non-(list or vector) of type 'NULL'

The data does have 94 columns, but I don't see why that might be triggering this error. I'd appreciate any helping making this code run properly!

EDIT: This is what I've been doing so far. I have to go in and manually change the column names for each set of columns I want to combine. There has to be a better way!

df$B1U_sweetnsour<-A1U_sweetnsour
df$B1U_sweetnsour[is.na(df$B1U_sweetnsour)]<- C1U_sweetnsour[is.na(A1U_sweetnsour)]

Upvotes: 1

Views: 104

Answers (3)

Ista
Ista

Reputation: 10437

The task itself is not difficult or complicated, though it appears that way because of the way the data is arranged. When you see variable names that convey more than one piece of information it is often helpful to ask yourself if the data can be arranged in simpler way. This simple claim is at the heart of the popular "tidy" approach to data manipulation in R. While I'm not a fan of everything that has been done in the name of being "tidy", this core claim is sound, and you violate it (as you've done spectacularly here) only at the risk of making your analysis much more difficult than it needs to be.

A good first step is to re-arrange the data so that data is not encoded in the column names:

df <- read.table(
    text = "A1U_sweet  A2F_dip  A3U_bbq  C1U_sweet  C2F_dip  C3U_bbq
1          2        1        NA         NA       NA
NA         NA       NA       4          1        2
2          4        7        NA         NA       NA",
header = TRUE)

library(tidyr)

df <- data.frame(id = 1:nrow(df), df)
dfl <- gather(df, key = "key", value = "value", -id)
dfl <- separate(dfl, key, into = c("key", "kind", "type"), sep = c(1, 4))
df2 <- spread(dfl, key, value)
df2
##   id kind  type  A  C
## 1  1  1U_ sweet  1 NA
## 2  1  2F_   dip  2 NA
## 3  1  3U_   bbq  1 NA
## 4  2  1U_ sweet NA  4
## 5  2  2F_   dip NA  1
## 6  2  3U_   bbq NA  2
## 7  3  1U_ sweet  2 NA
## 8  3  2F_   dip  4 NA
## 9  3  3U_   bbq  7 NA

This might seem like a lot of work, but it makes the data much easier to work with, and not only for this particular operation.

Now that the data has been converted to a sane arrangement the actual task is very simple:

df2 <- transform(df2, B = ifelse(is.na(A), C, A))
df2
##   id kind  type  A  C B
## 1  1  1U_ sweet  1 NA 1
## 2  1  2F_   dip  2 NA 2
## 3  1  3U_   bbq  1 NA 1
## 4  2  1U_ sweet NA  4 4
## 5  2  2F_   dip NA  1 1
## 6  2  3U_   bbq NA  2 2
## 7  3  1U_ sweet  2 NA 2
## 8  3  2F_   dip  4 NA 4
## 9  3  3U_   bbq  7 NA 7

I strongly encourage you to leave the data in this arrangement, as other operations are likely to be much easy when the data is represented this way as well. If you must put it back (e.g., for display purposes) you can do so:

df <- gather(df2, key = "key", value = "value", A, B, C)
df <- unite(df, "key", key, kind, type, sep = "")
df <- spread(df, key, value)
df
##   id A1U_sweet A2F_dip A3U_bbq B1U_sweet B2F_dip B3U_bbq C1U_sweet C2F_dip
## 1  1         1       2       1         1       2       1        NA      NA
## 2  2        NA      NA      NA         4       1       2         4       1
## 3  3         2       4       7         2       4       7        NA      NA
##   C3U_bbq
## 1      NA
## 2       2
## 3      NA

While this approach is obviously more verbose than some alternatives, it has the virtue of addressing the root cause of the difficulty rather than showing how to muddle through and survive the consequences of sub-optimal initial choices.

Upvotes: 1

Parfait
Parfait

Reputation: 107767

Consider mapply to compare A columns and C columns elementwise and assign all B columns at once. And use sub which unlike gsub, sub only replaces first occurrence in case there are A's elsewhere in column header.

new_B_cols <- sub("A", "B", names(df)[grep("^A", names(df))])

replace_na <- function(aa, cc) {
     aa[is.na(aa)] <- cc[is.na(aa)]
     return(aa) 
}

df[new_B_cols] <- mapply(replace_na, df[grep("^A", names(df))], df[grep("^C", names(df))])

df[order(names(df))]
#   A1U_sweet A2F_dip A3U_bbq B1U_sweet B2F_dip B3U_bbq C1U_sweet C2F_dip C3U_bbq
# 1         1       2       1         1       2       1        NA      NA      NA
# 2        NA      NA      NA         4       1       2         4       1       2
# 3         2       4       7         2       4       7        NA      NA      NA

Upvotes: 1

Riley J. Graham
Riley J. Graham

Reputation: 76

Try using head(types) to see if your types object has the information you intended it to. If not, adding value=TRUE to your grep command may be the solution you're looking for.

types <- grep('^A([0-9]|[12][0-9])[A-Z]_[a-z]+', names(df), value=TRUE) 
types <- substr(types, 2, Inf) ## Remove the "A"
    for (tp in types) {
      aa <- df[[paste0('A', tp)]] ## "A" column
      cc <- df[[paste0('C', tp)]] ## "C" column
      df[[paste0('B', tp)]] <- ifelse(is.na(aa), aa, cc)
      }

Upvotes: 0

Related Questions