Roccer
Roccer

Reputation: 919

How to repetitively replace substrings in variables in R

I've got the following task

Treatment$V010 <- as.numeric(substr(Treatment$V010,1,2))
Treatment$V020 <- as.numeric(substr(Treatment$V020,1,2))
[...]
Treatment$V1000 <- as.numeric(substr(Treatment$V1000,1,2))

I have 100 variables from $V010, $V020, $V030... to $V1000. Those are numbers of different length. I want to "extract" just the first two digits of the numbers and replace the old number with the new number which is two digits long.

My data frame "Treatment" has 80 more variables which i did not mention here, so it is my goal that this function will just be applied to the 100 variables mentioned.

How can I do that? I could write that command 100 times but I am sure there is a better solution.

Upvotes: 0

Views: 620

Answers (2)

Henrik
Henrik

Reputation: 67828

A solution where relevant columns are selected based on a pattern that can be described with a regular expression.

Regex explanation:
^: Start of string
V: Literal V
\\d{2}: Exactly 2 digits

Treatment <- data.frame(V010 = c(120, 130), x010 = c(120, 130), xV1000 = c(111, 222), V1000 = c(111, 222))
Treatment
#   V010 x010 xV1000 V1000
# 1  120  120    111   111
# 2  130  130    222   222

# columns with a name that matches the pattern (logical vector)
idx <- grepl(x = names(Treatment), pattern = "^V\\d{2}")

# substr the relevant columns
Treatment[ , idx] <- sapply(Treatment[ , idx], FUN = function(x){
  as.numeric(substr(x, 1, 2))
  })

Treatment
#   V010 x010 xV1000 V1000
# 1   12  120    111    11
# 2   13  130    222    22

Upvotes: 1

Jealie
Jealie

Reputation: 6277

Alright, let's do it. First thing first: as you want to get specific columns of your dataframe, you need to specify their names to access them:

cnames = paste0('V',formatC(seq(10,1000,by=10), width = 3, format = "d", flag = "0"))

(cnames is a vector containing c('V010','V020', ..., 'V1000'))

Next, we will get their indexes:

coli=unlist(sapply(cnames, function (x) which(colnames(Treatment)==x)))

(coli is a vector containing the indexes in Treatment of the relevant columns)

Finally, we will apply your function over these columns:

Treatment[coli] = mapply(function (x) as.numeric(substr(x, 1, 2)), Treatment[coli])

Does it work?

PS: if anyone has a better/more concise way to do it, please tell me :)

EDIT:

The intermediate step is not useful, as you can already use the column names cnames to get the relevant columns, i.e.

Treatment[cnames] = mapply(function (x) as.numeric(substr(x, 1, 2)), Treatment[cnames])

(the only advantage of doing the conversion from column names to column indexes is when there are some missing columns in the dataframe - in this case, Treatment['non existing column'] crashes with undefined columns selected)

Upvotes: 3

Related Questions