Marwah Al-kaabi
Marwah Al-kaabi

Reputation: 405

Using gsub function

I have a factor that has many levels and I want to instert a colon between the digits in all my data and I do not how to use gsub function for this. Example the data I have look like this:

ABCD*0801
ABCD*0701
ABCD*0902
ABCD*0311
ABCD*2001

and what I want is this :

ABCD*08:01
ABCD*07:01
ABCD*09:02
ABCD*03:11
ABCD*20:01

I used this code below but I dont understand it

gsub("(.{4})(.*)$", "\\1:\\2",hladata$DRB1_1)

can you please help me ?

Upvotes: 1

Views: 5819

Answers (2)

TimTeaFan
TimTeaFan

Reputation: 18541

This should work:

x <- as.factor(c("ABCD*0801",
                 "ABCD*0701",
                 "ABCD*0902",
                 "ABCD*0311",
                 "ABCD*2001"))

as.factor(gsub("(\\d{2}$)",":\\1", x))
#> [1] ABCD*08:01 ABCD*07:01 ABCD*09:02 ABCD*03:11 ABCD*20:01
#> Levels: ABCD*03:11 ABCD*07:01 ABCD*08:01 ABCD*09:02 ABCD*20:01

Created on 2021-07-30 by the reprex package (v0.3.0)

As @Roland points out in the comments, it’s more efficient to use sub on the factor levels().

x <- as.factor(c("ABCD*0801",
                 "ABCD*0701",
                 "ABCD*0902",
                 "ABCD*0311",
                 "ABCD*2001"))
 
levels(x) <- sub("(\\d{2}$)",":\\1", levels(x))
             
x
#> [1] ABCD*08:01 ABCD*07:01 ABCD*09:02 ABCD*03:11 ABCD*20:01
#> Levels: ABCD*03:11 ABCD*07:01 ABCD*08:01 ABCD*09:02 ABCD*20:01

Upvotes: 1

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Here are two options with sub(since you have just one match per string, gsub, which is for multiple matches per string, is not necessary):

sub("\\d{2}", "\\1:", x)

This works via backreference: the pattern matched in the first argument (the occurrence of two digits) is remembered and repeated in the replacement argument and a :is added to it.

sub("(?<=\\d{2})(?=\\d{2})", ":", x, perl = TRUE)

This, more complex, solution works with lookaround: the lookbehind (?<=\\d{2}) looks for two digits on the left of the match while (?=\\d{2}) looks for two digits on the right. Where the two lookarounds match, a :is inserted.

The code you used does not work because of the quantifying expression; you need to change it to {7} as there are seven characters before the point where you want to insert :. The way it works is similar to the second option above, namely via backreference: \\1 remembers and repeats the first seven characters captured in the first capturing group (...) while the second backreference \\2 remembers and repeats the second capturing group; between them : is added.

gsub("(.{7})(.*)$", "\\1:\\2", x)

Data:

x <- as.factor(c("ABCD*0801",
                 "ABCD*0701",
                 "ABCD*0902",
                 "ABCD*0311",
                 "ABCD*2001"))

Upvotes: 3

Related Questions