bob
bob

Reputation: 127

R: Why does \s{2} return "\"," in regex match? And new line not matching with \\n or \\r and other variations

I'm having trouble with (1) using a dynamic variable in a regex pattern and (2) matching "\" or new line. I'd really appreciate any help!


Example: Ultimately, however possible, I'd like to match the word Administrator in the text file below. The text file's data classification is character (it was originally a list and was coerced to character using as.character(). Here's the text snippet:

[1] "c(\"Silk Road Forums\", \"\", \"*\", \"Welcome, Guest. Please login or register.\", \"[          ] [          ] [Forever] [Login]\", \"Login with username, password and session length\", \"[                    ]  [Search] \", \"\", \"  â\\200¢ Home\", \"  â\\200¢ Search\", \"  â\\200¢ Login\", \"  â\\200¢ Register\", \"\", \"\", \"  â\\200¢ Silk Road Forums »\", \"  â\\200¢ Profile of Dread Pirate Roberts »\", \"  â\\200¢ Summary\", \"\", \"  â\\200¢ Profile Info\", \"      â–¡ Summary\", \"      â–¡ Show Stats\", \"      â–¡ Show Posts...\", \"          â\\230† Messages\", \n\"          â\\230† Topics\", \"          â\\230† Attachments\", \"\", \"[profile_sm]Summary\", \"\", \"Dread Pirate Roberts Administrator\", \"\", \"[index]\", \"      â–¡ SMF | SMF © 2013, Simple Machines\"\n)"

Attempts / Problems

  1. Tried to Match New Line: In that messy text (see above), I was able to match [profile_sm]Summary\. I tried to match what comes next in that text by using:

    • \\n -- failed
    • \\n\\r -- failed
    • \\n|\\r -- failed
    • \\r\\n -- failed
    • \\r|\\n -- failed

    It seems like there's no new line after so I tried to match the literal ""," (inside quotation marks: quotation mark and comma) that comes after characters in that text. So I also tried these two and they both failed: \\and \\"\".

  2. Tried to Use Variable: I tried to use variable X that includes Dread Pirate Roberts from a previous regex match turned into a vector. I tried to just put X into the regex pattern but it obviously didn't work. Is there away to create a pattern using X? For example: Match one of the values found in x.


I would need to know how to solve both of these problems / methods for other parts of my current project and would really love pointers and guidance. Thank you!


Edit Note: Saw that folks had trouble understanding this post so I edited to make it more legible. Thanks and shout-out to @Wiktor Stribiżew for reading through the original post despite the difficult wording and providing the answer! :)

Upvotes: 0

Views: 67

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

Your text only contains two newlines, you can easily check it using cat(text) and there are three lines:

c("Silk Road Forums", "", "*", "Welcome, Guest. Please login or register.", "[ ] [ ] [Forever] [Login]", "Login with username, password and session length", "[ ] [Search] ", "", " � Home", " � Search", " � Login", " � Register", "", "", " � Silk Road Forums »", " � Profile of Dread Pirate Roberts »", " � Summary", "", " � Profile Info", " □ Summary", " □ Show Stats", " □ Show Posts...", " � Messages", 
" � Topics", " � Attachments", "", "[profile_sm]Summary", "", "Dread Pirate Roberts Administrator", "", "[index]", " □ SMF | SMF © 2013, Simple Machines"
)

So, as you see, there is no newline after [profile_sm]Summary. Note to match [ in a regex pattern you need to escape it.. There is a space, " and commas You may match these chars using [,"\s]+ pattern. The X variable will hold Dread Pirate Roberts, so, to extract Administrator you may use

\[profile_sm]Summary[",\s]*Dread Pirate Roberts\s+\K[^"]+

See the regex demo.

Details

  • \[profile_sm]Summary - [profile_sm]Summary string
  • [",\s]* - 0+ ", , or whitespace chars
  • Dread Pirate Roberts - a literal string
  • \s+ - 1+ whitespaces
  • \K - match reset operator that discards text matched so far in the match memory buffer
  • [^"]+ - 1+ chars other than ". If you need to only match letter, digits or _ you may use \w+ instead of this pattern (with \\ in the string literal).

R demo:

text <- "c(\"Silk Road Forums\", \"\", \"*\", \"Welcome, Guest. Please login or register.\", \"[ ] [ ] [Forever] [Login]\", \"Login with username, password and session length\", \"[ ] [Search] \", \"\", \" â\200¢ Home\", \" â\200¢ Search\", \" â\200¢ Login\", \" â\200¢ Register\", \"\", \"\", \" â\200¢ Silk Road Forums »\", \" â\200¢ Profile of Dread Pirate Roberts »\", \" â\200¢ Summary\", \"\", \" â\200¢ Profile Info\", \" â–¡ Summary\", \" â–¡ Show Stats\", \" â–¡ Show Posts...\", \" â\230† Messages\", \n\" â\230† Topics\", \" â\230† Attachments\", \"\", \"[profile_sm]Summary\", \"\", \"Dread Pirate Roberts Administrator\", \"\", \"[index]\", \" â–¡ SMF | SMF © 2013, Simple Machines\"\n)"
X <- "Dread Pirate Roberts"
regex <- paste0('\\[profile_sm]Summary[",\\s]*',X,'\\s+\\K[^"]+')
regmatches(text, regexpr(regex, text, perl=TRUE))
## => [1] "Administrator"

Upvotes: 1

Related Questions