Reputation: 11
I have this project where I am trying to extract numbers from a web page. Below is an example of the text I am trying to parse.
"\naAreas = [\n[107447478,2490,1925,559,1016,962,0,18,0,1,110,],\n[107447440,2366,1800,565,1033,811,1,46,0,0,23,],\n[107447521,2933,2396,543,921,1566,0,11,0,0,115,]\n];\naRoutes = [\n];\n$(function() {\n $(\".typeTip\").attr(\"title\", \"T=Trad, S=Sport, TR=Toprope\");\n showTips();\n});\n"
I am looking for the numbers in bold, so anything between "\n[" and ",". I am trying to do this with the stringr package in R, but I'm not really all that familiar with regular expressions and I'm striking out.
Upvotes: 0
Views: 56
Reputation: 78842
stringr
builds on stringi
. This is a different approach using stringi
and V8
since you've got javascript there:
library(V8)
library(stringi)
js <- "\naAreas = [\n[107447478,2490,1925,559,1016,962,0,18,0,1,110,],\n[107447440,2366,1800,565,1033,811,1,46,0,0,23,],\n[107447521,2933,2396,543,921,1566,0,11,0,0,115,]\n];\naRoutes = [\n];\n$(function() {\n $(\".typeTip\").attr(\"title\", \"T=Trad, S=Sport, TR=Toprope\");\n showTips();\n});\n"
ctx <- v8()
We have to remove the jQuery bits since V8 can't deal with those, but once we do we can evaluate it as javascript:
ctx$eval(sprintf("var %s ", paste0(stri_split_lines(js)[[1]][2:6], collapse="\n")))
Then get the data:
ctx$get("aAreas")
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] 107447478 2490 1925 559 1016 962 0 18 0 1 110
## [2,] 107447440 2366 1800 565 1033 811 1 46 0 0 23
## [3,] 107447521 2933 2396 543 921 1566 0 11 0 0 115
Or, just the bits we want:
ctx$get("aAreas")[,1]
## [1] 107447478 107447440 107447521
Upvotes: 1
Reputation: 306
If you want to capture the numbers only you can try this:
(?:\\n\[)(\d+)
Upvotes: 0
Reputation: 110062
This works:
x <- "\naAreas = [\n[107447478,2490,1925,559,1016,962,0,18,0,1,110,],\n[107447440,2366,1800,565,1033,811,1,46,0,0,23,],\n[107447521,2933,2396,543,921,1566,0,11,0,0,115,]\n];\naRoutes = [\n];\n$(function() {\n $(\".typeTip\").attr(\"title\", \"T=Trad, S=Sport, TR=Toprope\");\n showTips();\n});\n"
stringr::str_extract_all(x, '(?<=\n\\[)\\d+')
## [[1]]
## [1] "107447478" "107447440" "107447521"
The(?<=\n\\[)
is a lookbehind and says make sure a new line and square brace proceed but don't capture them. The \\d+
says grab as many digits as you can after until there are no more digits.
Upvotes: 2