Reputation: 1285
I already saw this one, but it is not quite what I need:
Situation: Using gsub
, I want to clean up strings. These are my conditions:
' - _ $ .
as one. For example: don't
, re-loading
, come_home
, something$col
package::function
or package::function()
So, I have the following:
[^A-Za-z]
([a-z]+)(-|'|_|$)([a-z]+)
([a-z]+(_*)[a-z]+)(::)([a-z]+(_*)[a-z]+)(\(\))*
Examples:
If I have the following:
# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't
# Needs to handle NA for desc::desc_get()
# Update href of toc anchors , use "-" instead "."
# Keep something$col or here_you::must_stay
I would like to have
Re-loading pkgdown while it's running causes weird behaviour with the context cache don't
Needs to handle NA for desc::desc_get()
Update href of toc anchors use instead
Keep something$col or here_you::must_stay
Problems: I have several:
A. The second expression is not working properly. Right now, it only works with -
or '
B. How do I combine all of these in a single gsub
in R? I want to do something like gsub(myPatterns, myText)
, but don't know how to fix and combine all of this.
Upvotes: 2
Views: 443
Reputation: 627065
You can use
trimws(gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE))
See the regex demo. Or, to also replace multiple whitespaces with a single space, use
trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))
Details
(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)
: match either of the two patterns:
\w+::\w+(?:\(\))?
- 1+ word chars, ::
, 1+ word chars and an optional ()
substring|
- or\p{L}+
- one or more Unicode letters(?:[-'_$]\p{L}+)*
- 0+ repetitions of -
, '
, _
or $
and then 1+ Unicode letters(*SKIP)(*F)
- omits and skips the match|
- or[^\p{L}\s]
- any char but a Unicode letter and whitespaceSee the R demo:
myText <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
'# Update href of toc anchors , use "-" instead "."',
"# Keep something$col or here_you::must_stay")
trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))
Output:
[1] "Re-loading pkgdown while it's running causes weird behaviour with the context cache don't"
[2] "Needs to handle NA for desc::desc_get()"
[3] "Update href of toc anchors use instead"
[4] "Keep something$col or here_you::must_stay"
Upvotes: 4
Reputation: 160577
Alternatively,
txt <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
"# Update href of toc anchors , use \"-\" instead \".\"",
"# Keep something$col or here_you::must_stay")
expect <- c("Re-loading pkgdown while it's running causes weird behaviour with the context cache don't",
"Needs to handle NA for desc::desc_get()",
"Update href of toc anchors use instead",
"Keep something$col or here_you::must_stay")
leadspace <- grepl("^ ", txt)
gre <- gregexpr("\\b(\\s?[[:alpha:]]*(::|[-'_$.])?[[:alpha:]]*(\\(\\))?)\\b", txt)
regmatches(txt, gre, invert = TRUE) <- ""
txt[!leadspace] <- gsub("^ ", "", txt[!leadspace])
identical(expect, txt)
# [1] TRUE
Upvotes: 0