Reputation: 554
I have a character vector that I need to clean. Specifically, I want to remove the number that comes before the word "Votes." Note that the number has a comma to separate thousands, so it's easier to treat it as a string.
I know that gsub("*. Votes","", text) will remove everything, but how do I just remove the number? Also, how do I collapse the repeated spaces into just one space?
Thanks for any help you might have!
Example data:
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
Upvotes: 2
Views: 2766
Reputation: 4294
Easiest way is with stringr
:
> library(stringr)
> regexp <- "-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+"
> str_extract(text,regexp)
[1] "558,586 Votes"
To do the same thing but extract only the number, wrap it in gsub
:
> gsub('\\s+[[:alpha:]]+', '', str_extract(text,regexp))
[1] "558,586"
Here's a version that will strip out all numbers before the word "Votes" even if they have commas or periods in it:
> gsub('\\s+[[:alpha:]]+', '', unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) )) )
[1] "558,586"
If you want the label too, then just throw out the gsub
part:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) ))
[1] "558,586 Votes"
And if you want to pull out all the numbers:
> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]*",text) ))
[1] "1" "15" "202" "558,586"
Upvotes: 0
Reputation: 626738
You may use
text <- "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? 558,586 Votes"
trimws(gsub("(\\s){2,}|\\d[0-9,]*\\s*(Votes)", "\\1\\2", text))
# => [1] "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? Votes"
See the online R demo and the online regex demo.
Details
(\\s){2,}
- matches 2 or more whitespace chars while capturing the last occurrence that will be reinserted using the \1
placeholder in the replacement pattern|
- or\\d
- a digit[0-9,]*
- 0 or more digits or commas\\s*
- 0+ whitespace chars(Votes)
- Group 2 (will be restored in the output using the \2
placeholder): a Votes
substring.Note that trimws
will remove any leading/trailing whitespace.
Upvotes: 2