Usha Kota
Usha Kota

Reputation: 43

gsubfn() for variants of pattern string does not give expected output

I'm trying to match a partial pattern of the variable names in my data set and replace them all with another pattern using gsubfn().

I'm using R version 4.0.3 (2020-10-10).

The below code shows the sample pattern of variable names in the data set and how I tried to replace them

replace_str = c("Race..American.India", "Race.White")
gsubd_str = gsubfn(pattern = "Race..| Race.", "R_", x = replace_str)

When I used the pattern string as above, my output is:

> gsubd_str
[1] "R_American.India" "R_hite"

However, if I use (I changed the order of pattern to match):

gsubd_str = gsubfn(pattern = "Race.| Race..", "R_", x = replace_str)

then my output is:

gsubd_str
[1] "R_.American.India" "R_White"

In both the cases, my thoughts are that gsubfn() is not behaving as expected. At least in the second case, gsubfn() replaced the variable as soon as the LHS of "|" was TRUE. However, in the first case, after the match was found, gsubfn() replaced 3 characters "R", "." , "W" instead of 2, "R" and ".".

Not sure if I understood gsubfun() correctly.

Upvotes: 0

Views: 90

Answers (1)

Benjamin Christoffersen
Benjamin Christoffersen

Reputation: 4841

It is the space you added. The behavior of gsubfn is exactly like gsub as the documentation states:

# with the space
x <-  c("Race..American.India", "Race.White")
gsub("Race..| Race.", "R_", x)
#R> [1] "R_American.India" "R_hite"  
gsub("Race.| Race..", "R_", x)
#R> [1] "R_.American.India" "R_White" 

# without the space
gsub("Race..|Race.", "R_", x)
#R> [1] "R_American.India" "R_hite"  
gsub("Race.|Race..", "R_", x)
#R> [1] "R_American.India" "R_hite"  
gsubfn("Race..|Race.", "R_", x)
#R> [1] "R_American.India" "R_hite"  
gsubfn("Race..|Race.", "R_", x)
#R> [1] "R_American.India" "R_hite"  

Though, you can just do:

gsub("Race..?", "R_", x)
#R> [1] "R_American.India" "R_hite"

You also might like to use \\.. Otherwise, you may end up strange results like:

gsub("Race..?", "R_", c("Racehorses", "Racecourse", "Racerunner"))
#R> [1] "R_rses" "R_urse" "R_nner"
gsub("Race\\.\\.?", "R_", c("Racehorses", "Racecourse", "Racerunner"))
#R> [1] "Racehorses" "Racecourse" "Racerunner"

# still works
gsub("Race\\.\\.?", "R_", x)
#R> [1] "R_American.India" "R_White"

Original answer

In both the cases, my thoughts are that gsubfn() is not behaving as expected. ...

Yes, this seems like an issue with gsubfn. It works with gsub as shown below. A work around is to change the regular expression to "Race..?":

# works fine w/ gsub
x <-  c("Race..American.India", "Race.White")
gsub("Race..| Race.", "R_", x)
#R> [1] "R_American.India" "R_hite"  
gsub("Race.|Race..", "R_", x)
#R> [1] "R_American.India" "R_hite" 

# does not work with gsubfn
library(gsubfn)
gsubfn("Race..| Race.", "R_", x)
#R> [1] "R_American.India" "R_hite"
gsubfn("Race.| Race..", "R_", x)
#R> [1] "R_.American.India" "R_White" 

# you can do
gsubfn("Race..?", "R_", x)
#R> [1] "R_American.India" "R_hite" 

It is clearly stated in the manual page of gsubfn that:

If replacement is a string then it acts like gsub.

Thus, this must be a bug or maybe this is the catch from the documentation:

Note that if the "R" engine is used and if backref is non-negative then internally the pattern will be parenthesized.

Upvotes: 2

Related Questions