Reputation: 111
I have a background in data and have just been getting into scraping so forgive me if my web standards and languages is not up to scratch.
I am trying to scrape some data from a javascript component of a website I use. Viewing the page source I can actually see the data I need already there within javascript function calls in JSON format. For example it looks a little like this.
<script type="text/javascript">
$(document).ready(function () {
gameState = 4;
atView.init("/Data/FieldView/20152220150142207",{"a":[{"co":true,"col:"Red"}],"b":false,...)
meLine.init([{"c":100,"b":true,...)
</script>
Now, I only need the JSON data in meLine.init. If I physically copy/paste only the JSON data into a file I can then convert that with jsonlite in R and have exactly what I need.
However I don't want to have to copy/paste multiple pages so I need a way of extracting only this data and leaving everything else behind. I originally thought to save the html source code to R, convert to text and try and regex match "meLine.init(", but I'm not really getting anywhere with that. Could anyone offer some help?
Upvotes: 1
Views: 2030
Reputation: 6191
Normally I'd use XML and xpath to parse an html page but in this case (since you know the exact structure you're looking for) you might be able to do it directly with a bit of regular expressions (this is generally not a good idea as emphasized here). Not sure if this gets you exactly to your goal but
sub("[ ]+meLine.init\\((.+)\\)" , "\\1",
grep("meLine.init", readLines("file://test.html"), value=TRUE),
perl=TRUE)
will return the line you're looking for and then you can work your magic with jsonlite
. The idea is to read the page line by line. grep the (hopefully) single line that contains the string meLine.init
and then extract the JSON string from that. Replace file://test.html
with the URL you want to use
Upvotes: 2