BlueFeet
BlueFeet

Reputation: 2507

R regex: how to remove "*" only in between a group of variables

I have a group of variable var:

> var
[1] "a1" "a2" "a3" "a4"

here is what I want to achieve: using regex and change strings such as this:

 3*a1 + a1*a2 + 4*a3*a4 + a1*a3

to

 3a1 + a1*a2 + 4a3*a4 + a1*a3

Basically, I want to trim "*" that is not in between any values in var. Thank you in advance

Upvotes: 4

Views: 212

Answers (4)

BlueFeet
BlueFeet

Reputation: 2507

Thank @alistaire for offering a solution with non-capturing group. However, the solution replies on that there exists an space between the coefficient and "+" in front of it. Here's my modified solution based on his suggestion:

> ss <- "3*a1 + a1*a2+4*a3*a4 +2*a1*a3+ 4*a2*a3"
# my modified version
> gsub('((?:^|\\s|\\+|\\-)\\d)\\*(\\w)', '\\1\\2', ss) 
[1] "3a1 + a1*a2+4a3*a4 +2a1*a3+ 4a2*a3"

# alistire's
> gsub('((?:^| )\\d)\\*(\\w)', '\\1\\2', ss)
[1] "3a1 + a1*a2+4*a3*a4 +2*a1*a3+ 4a2*a3"

Upvotes: 0

user557597
user557597

Reputation:

Can do find (?<![\da-z])(\d+)\* replace $1

 (?<! [\da-z] )
 ( \d+ )                       # (1)
 \*

Or, ((?:[^\da-z]|^)\d+)\* for the assertion impaired engines

 (                             # (1 start)
      (?: [^\da-z] | ^ )
      \d+ 
 )                             # (1 end)
 \*

Leading assertions are bad anyways.

Benchmark

Regex1:   (?<![\da-z])(\d+)\*
Options:  < none >
Completed iterations:   100  /  100     ( x 1000 )
Matches found per iteration:   2
Elapsed Time:    1.09 s,   1087.84 ms,   1087844 µs


Regex2:   ((?:[^\da-z]|^)\d+)\*
Options:  < none >
Completed iterations:   100  /  100     ( x 1000 )
Matches found per iteration:   2
Elapsed Time:    0.77 s,   767.04 ms,   767042 µs

Upvotes: 3

alistaire
alistaire

Reputation: 43344

Taking the equation as a string, one option is

gsub('((?:^| )\\d)\\*(\\w)', '\\1\\2', '3*a1 + a1*a2 + 4*a3*a4 + a1*a3')
# [1] "3a1 + a1*a2 + 4a3*a4 + a1*a3"

which looks for

  • a captured group of characters, ( ... )
    • containing a non-capturing group, (?: ... )
      • containing the beginning of the line ^
      • or, |
      • a space (or \\s)
    • followed by a digit 0-9, \\d.
  • The capturing group is followed by an asterisk, \\*,
  • followed by another capturing group ( ... )
    • containing an alphanumeric character \\w.

It replaces the above with

  • the first captured group, \\1,
  • followed by the second captured group, \\2.

Adjust as necessary.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626920

You can create a dynamic regex out of the var to match and capture *s that are inside your variables, and reinsert them back with a backreference in gsub, and remove all other asterisks:

var <- c("a1","a2","a3","a4")
s = "3*a1 + a1*a2 + 4*a3*a4 + a1*a3"
block = paste(var, collapse="|")
pat = paste0("\\b((?:", block, ")\\*)(?=\\b(?:", block, ")\\b)|\\*")
gsub(pat, "\\1", s, perl=T)
## "3a1 + a1*a2 + 4a3*a4 + a1*a3"

See the IDEONE demo

Here is the regex:

\b((?:a1|a2|a3|a4)\*)(?=\b(?:a1|a2|a3|a4)\b)|\*

Details:

  • \b - leading word boundary
  • ((?:a1|a2|a3|a4)\*) - Group 1 matching
    • (?:a1|a2|a3|a4) - either one of your variables
    • \* - asterisk
    • (?=\b(?:a1|a2|a3|a4)\b) - a lookahead check that there must be one of your variables (otherwise, no match is returned, the * is matched with the second branch of the alternation)
  • | - or
  • \* - a "wild" literal asterisk to be removed.

Upvotes: 2

Related Questions