ModalBro
ModalBro

Reputation: 554

Extract numbers after a pattern in vector of characters

I'm trying to extract values from a vector of strings. Each string in the vector, (there are about 2300 in the vector), follows the pattern of the example below:

"733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro   Anti|X      O   Word use (bold, add alternate)|X      O   Examples (italicize)|O      O   Extra information (underline)|X      O   Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"  

What I'd like is to extract the numbers following the pattern "Sent. " and place them into a separate vector. For the example, I'd like to extract "1311531".

I'm having trouble using gsub to accomplish this.

Upvotes: 1

Views: 612

Answers (3)

Ronak Shah
Ronak Shah

Reputation: 388862

We can use str_match_all from stringr to get all the numbers followed by "Sent".

str_match_all(text, "Sent.*?_+(\\d+)")[[1]][, 2]
#[1] "1" "3" "1" "1" "5" "3" "1"

Upvotes: 1

Orlando Sabogal
Orlando Sabogal

Reputation: 1630

library(tidyverse)

Data <- c("PASTE YOUR WHOLE STRING")

str_locate(Data, "Sent. ")
Reference <- str_locate_all(Data, "Sent. ") %>% as.data.frame()
Reference %>% names() #Returns [1] "start" "end"  
Reference <- Reference %>% mutate(end = end +1)

YourNumbers <- substr(Data,start = Reference$end[1], stop = Reference$end[1])

for (i in 2:dim(Reference)[1]){
  Temp <- substr(Data,start = Reference$end[i], stop = Reference$end[i])
  YourNumbers <- paste(YourNumbers, Temp, sep = "")
}

YourNumbers #Returns "1234567"

Upvotes: 2

Maurits Evers
Maurits Evers

Reputation: 50668

A base R option using strsplit and sub

lapply(strsplit(ss, "\\|"), function(x)
    sub("Sent.+: _+(\\d+)_+", "\\1", x[grepl("^Sent", x)]))
#[[1]]
#[1] "1" "3" "1" "1" "5" "3" "1"

Sample data

ss <- "733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro   Anti|X      O   Word use (bold, add alternate)|X      O   Examples (italicize)|O      O   Extra information (underline)|X      O   Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"

Upvotes: 1

Related Questions