Carrol
Carrol

Reputation: 1285

Clean string using gsub and multiple conditions

I already saw this one, but it is not quite what I need:


Situation: Using gsub, I want to clean up strings. These are my conditions:

  1. Keep words only (no digits nor "weird" symbols)
  2. Keep those words separated with one of (just one) ' - _ $ . as one. For example: don't, re-loading, come_home, something$col
  3. keep specific names, such as package::function or package::function()

So, I have the following:

  1. [^A-Za-z]
  2. ([a-z]+)(-|'|_|$)([a-z]+)
  3. ([a-z]+(_*)[a-z]+)(::)([a-z]+(_*)[a-z]+)(\(\))*

Examples:

If I have the following:

# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't
# Needs to handle NA for desc::desc_get()
# Update href of toc anchors , use "-" instead "."
# Keep something$col or here_you::must_stay

I would like to have

Re-loading pkgdown while it's running causes weird behaviour with the context cache don't
Needs to handle NA for desc::desc_get()
Update href of toc anchors use instead
Keep something$col or here_you::must_stay

Problems: I have several:

A. The second expression is not working properly. Right now, it only works with - or '

B. How do I combine all of these in a single gsub in R? I want to do something like gsub(myPatterns, myText), but don't know how to fix and combine all of this.

Upvotes: 2

Views: 443

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627065

You can use

trimws(gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE))

See the regex demo. Or, to also replace multiple whitespaces with a single space, use

trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))

Details

  • (?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F): match either of the two patterns:
    • \w+::\w+(?:\(\))? - 1+ word chars, ::, 1+ word chars and an optional () substring
    • | - or
    • \p{L}+ - one or more Unicode letters
    • (?:[-'_$]\p{L}+)* - 0+ repetitions of -, ', _ or $ and then 1+ Unicode letters
  • (*SKIP)(*F) - omits and skips the match
  • | - or
  • [^\p{L}\s] - any char but a Unicode letter and whitespace

See the R demo:

myText <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
'# Update href of toc anchors , use "-" instead "."',
"# Keep something$col or here_you::must_stay")
trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))

Output:

[1] "Re-loading pkgdown while it's running causes weird behaviour with the context cache don't"
[2] "Needs to handle NA for desc::desc_get()"                                                  
[3] "Update href of toc anchors use instead"                                                   
[4] "Keep something$col or here_you::must_stay"    

Upvotes: 4

r2evans
r2evans

Reputation: 160577

Alternatively,

txt <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't", 
         "# Needs to handle NA for desc::desc_get()",
         "# Update href of toc anchors , use \"-\" instead \".\"", 
         "# Keep something$col or here_you::must_stay")
expect <- c("Re-loading pkgdown while it's running causes weird behaviour with the context cache don't",
            "Needs to handle NA for desc::desc_get()",
            "Update href of toc anchors use instead",
            "Keep something$col or here_you::must_stay")

leadspace <- grepl("^ ", txt)
gre <- gregexpr("\\b(\\s?[[:alpha:]]*(::|[-'_$.])?[[:alpha:]]*(\\(\\))?)\\b", txt)
regmatches(txt, gre, invert = TRUE) <- ""
txt[!leadspace] <- gsub("^ ", "", txt[!leadspace])
identical(expect, txt)
# [1] TRUE

Upvotes: 0

Related Questions