wyatt
wyatt

Reputation: 383

adding a comma and "" to a specific text

I don't know what category my question falls into. I have a text that has a pattern like below.

1 MERRILL LYNCH 33 2 LEHMAN BROTHERS HLDGS. 82 3 SALOMON 149 4 PAINE WEBBER GROUP 248 5 BEAR STEARNS 328 6 CHARLES SCHWAB 621 7 A.G. EDWARDS & SONS 823

The pattern is (sequence from 1, a company name (consists of characters or numbers), number (maximum 1000)) repeated

I want to (build a function) that turns this text into a vector;

c("1 MERRILL LYNCH 33", "2 LEHMAN BROTHERS HLDGS. 82", "3 SALOMON 149", 
  "4 PAINE WEBBER GROUP 248", "5 BEAR STEARNS 328", "6 CHARLES SCHWAB 621", 
  "7 A.G. EDWARDS & SONS 823")

Would this be possible? There's no regularity in the company name or the number that follows. There's always a space after the first increasing sequence, a space after a company name. I can provide more information if necessary.

Upvotes: 1

Views: 70

Answers (3)

akrun
akrun

Reputation: 887173

Or another option is strsplit from base R

strsplit(txt, "(?<=[0-9])\\s+(?=[0-9])", perl = TRUE)[[1]]
#[1] "1 MERRILL LYNCH 33"          "2 LEHMAN BROTHERS HLDGS. 82" "3 SALOMON 149" 
#[4] "4 PAINE WEBBER GROUP 248" "5 BEAR STEARNS 328"  "6 CHARLES SCHWAB 621"
#[7] "7 A.G. EDWARDS & SONS 823"  

Or another base R option would be with gsub and scan

scan(text = gsub("(\\d+) (\\d+)", "\\1,\\2", txt), what = "", sep=",", quiet = TRUE)
#[1] "1 MERRILL LYNCH 33"          "2 LEHMAN BROTHERS HLDGS. 82" "3 SALOMON 149"             
#[4] "4 PAINE WEBBER GROUP 248"  "5 BEAR STEARNS 328"   
#[6]   "6 CHARLES SCHWAB 621"        "7 A.G. EDWARDS & SONS 823"  

Upvotes: 1

acylam
acylam

Reputation: 18681

Analogous to @Remeko Duursma's answer, here is the base R version:

regmatches(txt, gregexpr("[0-9]+[^0-9]+[0-9]+", txt))[[1]]

Results:

[1] "1 MERRILL LYNCH 33"          "2 LEHMAN BROTHERS HLDGS. 82"
[3] "3 SALOMON 149"               "4 PAINE WEBBER GROUP 248"   
[5] "5 BEAR STEARNS 328"          "6 CHARLES SCHWAB 621"       
[7] "7 A.G. EDWARDS & SONS 823"

Upvotes: 2

Remko Duursma
Remko Duursma

Reputation: 2821

Using the stringr package,

library(stringr)
str_extract_all(txt, "[0-9]+\\D+[0-9]+")

The regular expression reads 'any number of digits', then 'anything except digits', then 'any number of digits'.

gives

[[1]]
[1] "1 MERRILL LYNCH 33"          "2 LEHMAN BROTHERS HLDGS. 82" "3 SALOMON 149"              
[4] "4 PAINE WEBBER GROUP 248"    "5 BEAR STEARNS 328"          "6 CHARLES SCHWAB 621"       
[7] "7 A.G. EDWARDS & SONS 823"

Note that the result is a list.

Upvotes: 4

Related Questions