Reputation: 383
I don't know what category my question falls into. I have a text that has a pattern like below.
1 MERRILL LYNCH 33 2 LEHMAN BROTHERS HLDGS. 82 3 SALOMON 149 4 PAINE WEBBER GROUP 248 5 BEAR STEARNS 328 6 CHARLES SCHWAB 621 7 A.G. EDWARDS & SONS 823
The pattern is (sequence from 1, a company name (consists of characters or numbers), number (maximum 1000)) repeated
I want to (build a function) that turns this text into a vector;
c("1 MERRILL LYNCH 33", "2 LEHMAN BROTHERS HLDGS. 82", "3 SALOMON 149",
"4 PAINE WEBBER GROUP 248", "5 BEAR STEARNS 328", "6 CHARLES SCHWAB 621",
"7 A.G. EDWARDS & SONS 823")
Would this be possible? There's no regularity in the company name or the number that follows. There's always a space after the first increasing sequence, a space after a company name. I can provide more information if necessary.
Upvotes: 1
Views: 70
Reputation: 887173
Or another option is strsplit
from base R
strsplit(txt, "(?<=[0-9])\\s+(?=[0-9])", perl = TRUE)[[1]]
#[1] "1 MERRILL LYNCH 33" "2 LEHMAN BROTHERS HLDGS. 82" "3 SALOMON 149"
#[4] "4 PAINE WEBBER GROUP 248" "5 BEAR STEARNS 328" "6 CHARLES SCHWAB 621"
#[7] "7 A.G. EDWARDS & SONS 823"
Or another base R
option would be with gsub
and scan
scan(text = gsub("(\\d+) (\\d+)", "\\1,\\2", txt), what = "", sep=",", quiet = TRUE)
#[1] "1 MERRILL LYNCH 33" "2 LEHMAN BROTHERS HLDGS. 82" "3 SALOMON 149"
#[4] "4 PAINE WEBBER GROUP 248" "5 BEAR STEARNS 328"
#[6] "6 CHARLES SCHWAB 621" "7 A.G. EDWARDS & SONS 823"
Upvotes: 1
Reputation: 18681
Analogous to @Remeko Duursma's answer, here is the base R version:
regmatches(txt, gregexpr("[0-9]+[^0-9]+[0-9]+", txt))[[1]]
Results:
[1] "1 MERRILL LYNCH 33" "2 LEHMAN BROTHERS HLDGS. 82"
[3] "3 SALOMON 149" "4 PAINE WEBBER GROUP 248"
[5] "5 BEAR STEARNS 328" "6 CHARLES SCHWAB 621"
[7] "7 A.G. EDWARDS & SONS 823"
Upvotes: 2
Reputation: 2821
Using the stringr
package,
library(stringr)
str_extract_all(txt, "[0-9]+\\D+[0-9]+")
The regular expression reads 'any number of digits', then 'anything except digits', then 'any number of digits'.
gives
[[1]]
[1] "1 MERRILL LYNCH 33" "2 LEHMAN BROTHERS HLDGS. 82" "3 SALOMON 149"
[4] "4 PAINE WEBBER GROUP 248" "5 BEAR STEARNS 328" "6 CHARLES SCHWAB 621"
[7] "7 A.G. EDWARDS & SONS 823"
Note that the result is a list.
Upvotes: 4