I have a group of variable var : > var [1] "a1" "a2" "a3" "a4" here is what I want to achieve: using regex and change strings such as this: 3*a1 + a1*a2 + 4*a3*a4 + a1*a3 to 3a1 + a1*a2 + 4a3*a4 + a1*a3 Basically, I want to trim "*" that is not in between any values in var . Thank you in advance

Reputation: 2507

R regex: how to remove "*" only in between a group of variables

I have a group of variable var:

> var
[1] "a1" "a2" "a3" "a4"

here is what I want to achieve: using regex and change strings such as this:

 3*a1 + a1*a2 + 4*a3*a4 + a1*a3

 3a1 + a1*a2 + 4a3*a4 + a1*a3

Basically, I want to trim "*" that is not in between any values in var. Thank you in advance

Upvotes: 4

Answers (4)

BlueFeet

Reputation: 2507

Thank @alistaire for offering a solution with non-capturing group. However, the solution replies on that there exists an space between the coefficient and "+" in front of it. Here's my modified solution based on his suggestion:

> ss <- "3*a1 + a1*a2+4*a3*a4 +2*a1*a3+ 4*a2*a3"
# my modified version
> gsub('((?:^|\\s|\\+|\\-)\\d)\\*(\\w)', '\\1\\2', ss) 
[1] "3a1 + a1*a2+4a3*a4 +2a1*a3+ 4a2*a3"

# alistire's
> gsub('((?:^| )\\d)\\*(\\w)', '\\1\\2', ss)
[1] "3a1 + a1*a2+4*a3*a4 +2*a1*a3+ 4a2*a3"

Upvotes: 0

user557597

Reputation:

Can do find (?<![\da-z])(\d+)\* replace $1

 (?<! [\da-z] )
 ( \d+ )                       # (1)
 \*

Or, ((?:[^\da-z]|^)\d+)\* for the assertion impaired engines

 (                             # (1 start)
      (?: [^\da-z] | ^ )
      \d+ 
 )                             # (1 end)
 \*

Leading assertions are bad anyways.

Benchmark

Regex1:   (?<![\da-z])(\d+)\*
Options:  < none >
Completed iterations:   100  /  100     ( x 1000 )
Matches found per iteration:   2
Elapsed Time:    1.09 s,   1087.84 ms,   1087844 µs


Regex2:   ((?:[^\da-z]|^)\d+)\*
Options:  < none >
Completed iterations:   100  /  100     ( x 1000 )
Matches found per iteration:   2
Elapsed Time:    0.77 s,   767.04 ms,   767042 µs

Upvotes: 3

alistaire

Reputation: 43344

Taking the equation as a string, one option is

gsub('((?:^| )\\d)\\*(\\w)', '\\1\\2', '3*a1 + a1*a2 + 4*a3*a4 + a1*a3')
# [1] "3a1 + a1*a2 + 4a3*a4 + a1*a3"

which looks for

a captured group of characters, ( ... )
- containing a non-capturing group, (?: ... )
  - containing the beginning of the line ^
  - or, |
  - a space (or \\s)
- followed by a digit 0-9, \\d.
The capturing group is followed by an asterisk, \\*,
followed by another capturing group ( ... )
- containing an alphanumeric character \\w.

It replaces the above with

the first captured group, \\1,
followed by the second captured group, \\2.

Adjust as necessary.

Upvotes: 1

Wiktor Stribiżew

Reputation: 626920

You can create a dynamic regex out of the var to match and capture *s that are inside your variables, and reinsert them back with a backreference in gsub, and remove all other asterisks:

var <- c("a1","a2","a3","a4")
s = "3*a1 + a1*a2 + 4*a3*a4 + a1*a3"
block = paste(var, collapse="|")
pat = paste0("\\b((?:", block, ")\\*)(?=\\b(?:", block, ")\\b)|\\*")
gsub(pat, "\\1", s, perl=T)
## "3a1 + a1*a2 + 4a3*a4 + a1*a3"

See the IDEONE demo

Here is the regex:

\b((?:a1|a2|a3|a4)\*)(?=\b(?:a1|a2|a3|a4)\b)|\*

Details:

\b - leading word boundary
((?:a1|a2|a3|a4)\*) - Group 1 matching
- (?:a1|a2|a3|a4) - either one of your variables
- \* - asterisk
- (?=\b(?:a1|a2|a3|a4)\b) - a lookahead check that there must be one of your variables (otherwise, no match is returned, the * is matched with the second branch of the alternation)
| - or
\* - a "wild" literal asterisk to be removed.

Upvotes: 2

R regex: how to remove &quot;*&quot; only in between a group of variables

Answers (4)

Related Questions

R regex: how to remove "*" only in between a group of variables