Reputation: 11

Parsing a web page with stringr

I have this project where I am trying to extract numbers from a web page. Below is an example of the text I am trying to parse.

"\naAreas = [\n[107447478,2490,1925,559,1016,962,0,18,0,1,110,],\n[107447440,2366,1800,565,1033,811,1,46,0,0,23,],\n[107447521,2933,2396,543,921,1566,0,11,0,0,115,]\n];\naRoutes = [\n];\n$(function() {\n $(\".typeTip\").attr(\"title\", \"T=Trad, S=Sport, TR=Toprope\");\n showTips();\n});\n"

I am looking for the numbers in bold, so anything between "\n[" and ",". I am trying to do this with the stringr package in R, but I'm not really all that familiar with regular expressions and I'm striking out.

Upvotes: 0

Answers (3)

hrbrmstr

Reputation: 78842

stringr builds on stringi. This is a different approach using stringi and V8 since you've got javascript there:

library(V8)
library(stringi)

js <- "\naAreas = [\n[107447478,2490,1925,559,1016,962,0,18,0,1,110,],\n[107447440,2366,1800,565,1033,811,1,46,0,0,23,],\n[107447521,2933,2396,543,921,1566,0,11,0,0,115,]\n];\naRoutes = [\n];\n$(function() {\n $(\".typeTip\").attr(\"title\", \"T=Trad, S=Sport, TR=Toprope\");\n showTips();\n});\n"

ctx <- v8()

We have to remove the jQuery bits since V8 can't deal with those, but once we do we can evaluate it as javascript:

ctx$eval(sprintf("var %s ", paste0(stri_split_lines(js)[[1]][2:6], collapse="\n")))

Then get the data:

ctx$get("aAreas")
##           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## [1,] 107447478 2490 1925  559 1016  962    0   18    0     1   110
## [2,] 107447440 2366 1800  565 1033  811    1   46    0     0    23
## [3,] 107447521 2933 2396  543  921 1566    0   11    0     0   115

Or, just the bits we want:

ctx$get("aAreas")[,1]
## [1] 107447478 107447440 107447521

Upvotes: 1

Anthony

Reputation: 306

If you want to capture the numbers only you can try this:

(?:\\n\[)(\d+)

Upvotes: 0

Tyler Rinker

Reputation: 110062

This works:

x <- "\naAreas = [\n[107447478,2490,1925,559,1016,962,0,18,0,1,110,],\n[107447440,2366,1800,565,1033,811,1,46,0,0,23,],\n[107447521,2933,2396,543,921,1566,0,11,0,0,115,]\n];\naRoutes = [\n];\n$(function() {\n $(\".typeTip\").attr(\"title\", \"T=Trad, S=Sport, TR=Toprope\");\n showTips();\n});\n"

stringr::str_extract_all(x, '(?<=\n\\[)\\d+')
## [[1]]
## [1] "107447478" "107447440" "107447521"

The(?<=\n\\[) is a lookbehind and says make sure a new line and square brace proceed but don't capture them. The \\d+ says grab as many digits as you can after until there are no more digits.

Upvotes: 2

Parsing a web page with stringr

Answers (3)

Related Questions