ModalBro
ModalBro

Reputation: 554

Remove the string before a certain word with R

I have a character vector that I need to clean. Specifically, I want to remove the number that comes before the word "Votes." Note that the number has a comma to separate thousands, so it's easier to treat it as a string.

I know that gsub("*. Votes","", text) will remove everything, but how do I just remove the number? Also, how do I collapse the repeated spaces into just one space?

Thanks for any help you might have!

Example data:

text <- "STATE QUESTION NO. 1                       Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee?                    558,586 Votes"

Upvotes: 2

Views: 2766

Answers (2)

mysteRious
mysteRious

Reputation: 4294

Easiest way is with stringr:

> library(stringr)
> regexp <- "-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+"
> str_extract(text,regexp)
[1] "558,586 Votes"

To do the same thing but extract only the number, wrap it in gsub:

> gsub('\\s+[[:alpha:]]+', '', str_extract(text,regexp))
[1] "558,586"

Here's a version that will strip out all numbers before the word "Votes" even if they have commas or periods in it:

> gsub('\\s+[[:alpha:]]+', '', unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) )) )
[1] "558,586"

If you want the label too, then just throw out the gsub part:

> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]* Votes+",text) )) 
[1] "558,586 Votes"

And if you want to pull out all the numbers:

> unlist(regmatches (text,gregexpr("-?[[:digit:]]+\\.*,*[[:digit:]]*\\.*,*[[:digit:]]*",text) ))
[1] "1"       "15"      "202"     "558,586"

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You may use

text <- "STATE QUESTION NO. 1                       Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee?                    558,586 Votes"
trimws(gsub("(\\s){2,}|\\d[0-9,]*\\s*(Votes)", "\\1\\2", text))
# => [1] "STATE QUESTION NO. 1 Amendment to Title 15 of the Nevada Revised Statutes Shall Chapter 202 of the Nevada Revised Statutes be amended to prohibit, except in certain circumstances, a person from selling or transferring a firearm to another person unless a federally-licensed dealer first conducts a federal background check on the potential buyer or transferee? Votes"

See the online R demo and the online regex demo.

Details

  • (\\s){2,} - matches 2 or more whitespace chars while capturing the last occurrence that will be reinserted using the \1 placeholder in the replacement pattern
  • | - or
  • \\d - a digit
  • [0-9,]* - 0 or more digits or commas
  • \\s* - 0+ whitespace chars
  • (Votes) - Group 2 (will be restored in the output using the \2 placeholder): a Votes substring.

Note that trimws will remove any leading/trailing whitespace.

Upvotes: 2

Related Questions