rio
rio

Reputation: 9

How to extract the text in the string using regular expression?

I crawled a following text from websites:

\n\tsfw=1;\n\tcaptions=[[\"2017-05\",2850],[\"2017-06\",3450],[\"2017-07\",3350],[\"2017-08\",3650],[\"2017-09\",4250],[\"2017-10\",4600],[\"2017-11\",4750],[\"2017-12\",5500],[\"2018-01\",7300],[\"2018-02\",7700],[\"2018-03\",11350],[\"2018-04\",10900],[\"2018-05\",11500],[\"2018-06\",10800],[\"2018-07\",13200],[\"2018-08\",14200],[\"2018-09\",15900],[\"2018-10\",19700],[\"2018-11\",21800],[\"2018-12\",18300],[\"2019-01\",18550],[\"2019-02\",18150],[\"2019-03\",18050],[\"2019-04\",18850],[\"2019-05\",71000],[\"2019-06\",83200],[\"2019-07\",72650],[\"2019-08\",80400],[\"2019-09\",100600],[\"2019-10\",114000],[\"2019-11\",110250],[\"2019-12\",107100],[\"2020-01\",116050],[\"2020-02\",117950],[\"2020-03\",145350]];\n

I want to extract the text like the numeric "2850", "3450", etc.. Could you show how to write the regular expression for this? Thanks.

Here is the code for crawling the text in r:

webpage <- read_html("https://imgflip.com/meme/Drake-Hotline-Bling")
text <- html_nodes(webpage,xpath = '//script') %>% html_text()
text <- text[8]

Upvotes: 0

Views: 56

Answers (1)

morgan121
morgan121

Reputation: 2253

You can use str_match_all from the stringr package. This will look for everything in between the two delimiters you choose (in this case I chose \", and ]).

The generic form of this function is: stringr::str_match_all(vector, "delimiter(.*?)delimiter") (note use stringr::str_match to only match the first instanct)

vec <- '\n\tsfw=1;\n\tcaptions=[[\"2017-05\",2850],[\"2017-06\",3450],[\"2017-07\",3350],[\"2017-08\",3650],[\"2017-09\",4250],[\"2017-10\",4600],[\"2017-11\",4750],[\"2017-12\",5500],[\"2018-01\",7300],[\"2018-02\",7700],[\"2018-03\",11350],[\"2018-04\",10900],[\"2018-05\",11500],[\"2018-06\",10800],[\"2018-07\",13200],[\"2018-08\",14200],[\"2018-09\",15900],[\"2018-10\",19700],[\"2018-11\",21800],[\"2018-12\",18300],[\"2019-01\",18550],[\"2019-02\",18150],[\"2019-03\",18050],[\"2019-04\",18850],[\"2019-05\",71000],[\"2019-06\",83200],[\"2019-07\",72650],[\"2019-08\",80400],[\"2019-09\",100600],[\"2019-10\",114000],[\"2019-11\",110250],[\"2019-12\",107100],[\"2020-01\",116050],[\"2020-02\",117950],[\"2020-03\",145350]];\n'

stringr::str_match_all(vec, "\",(.*?)]")[[1]][, 2]
[1]  "2850"   "3450"   "3350"   "3650"   "4250"   "4600"   "4750"   "5500"   "7300"   "7700"   "11350"  "10900"  "11500"  "10800"  "13200"  "14200"  "15900" 
[18] "19700"  "21800"  "18300"  "18550"  "18150"  "18050"  "18850"  "71000"  "83200"  "72650"  "80400"  "100600" "114000" "110250" "107100" "116050" "117950"
[35] "145350"

Upvotes: 1

Related Questions