Reputation: 9
I crawled a following text from websites:
\n\tsfw=1;\n\tcaptions=[[\"2017-05\",2850],[\"2017-06\",3450],[\"2017-07\",3350],[\"2017-08\",3650],[\"2017-09\",4250],[\"2017-10\",4600],[\"2017-11\",4750],[\"2017-12\",5500],[\"2018-01\",7300],[\"2018-02\",7700],[\"2018-03\",11350],[\"2018-04\",10900],[\"2018-05\",11500],[\"2018-06\",10800],[\"2018-07\",13200],[\"2018-08\",14200],[\"2018-09\",15900],[\"2018-10\",19700],[\"2018-11\",21800],[\"2018-12\",18300],[\"2019-01\",18550],[\"2019-02\",18150],[\"2019-03\",18050],[\"2019-04\",18850],[\"2019-05\",71000],[\"2019-06\",83200],[\"2019-07\",72650],[\"2019-08\",80400],[\"2019-09\",100600],[\"2019-10\",114000],[\"2019-11\",110250],[\"2019-12\",107100],[\"2020-01\",116050],[\"2020-02\",117950],[\"2020-03\",145350]];\n
I want to extract the text like the numeric "2850", "3450", etc.. Could you show how to write the regular expression for this? Thanks.
Here is the code for crawling the text in r:
webpage <- read_html("https://imgflip.com/meme/Drake-Hotline-Bling")
text <- html_nodes(webpage,xpath = '//script') %>% html_text()
text <- text[8]
Upvotes: 0
Views: 56
Reputation: 2253
You can use str_match_all
from the stringr
package. This will look for everything in between the two delimiters you choose (in this case I chose \",
and ]
).
The generic form of this function is: stringr::str_match_all(vector, "delimiter(.*?)delimiter")
(note use stringr::str_match
to only match the first instanct)
vec <- '\n\tsfw=1;\n\tcaptions=[[\"2017-05\",2850],[\"2017-06\",3450],[\"2017-07\",3350],[\"2017-08\",3650],[\"2017-09\",4250],[\"2017-10\",4600],[\"2017-11\",4750],[\"2017-12\",5500],[\"2018-01\",7300],[\"2018-02\",7700],[\"2018-03\",11350],[\"2018-04\",10900],[\"2018-05\",11500],[\"2018-06\",10800],[\"2018-07\",13200],[\"2018-08\",14200],[\"2018-09\",15900],[\"2018-10\",19700],[\"2018-11\",21800],[\"2018-12\",18300],[\"2019-01\",18550],[\"2019-02\",18150],[\"2019-03\",18050],[\"2019-04\",18850],[\"2019-05\",71000],[\"2019-06\",83200],[\"2019-07\",72650],[\"2019-08\",80400],[\"2019-09\",100600],[\"2019-10\",114000],[\"2019-11\",110250],[\"2019-12\",107100],[\"2020-01\",116050],[\"2020-02\",117950],[\"2020-03\",145350]];\n'
stringr::str_match_all(vec, "\",(.*?)]")[[1]][, 2]
[1] "2850" "3450" "3350" "3650" "4250" "4600" "4750" "5500" "7300" "7700" "11350" "10900" "11500" "10800" "13200" "14200" "15900"
[18] "19700" "21800" "18300" "18550" "18150" "18050" "18850" "71000" "83200" "72650" "80400" "100600" "114000" "110250" "107100" "116050" "117950"
[35] "145350"
Upvotes: 1