Reputation: 39913
Let me preface this by saying I don't care what language this solution gets written in as long as it runs on windows.
My problem is this: there is a site that has data which is frequently updated that I would like to get at regular intervals for later reporting. The site requires JavaScript to work properly so just using wget doesn't work. What is a good way to either embed a browser in a program or use a stand-alone browser to routinely scrape the screen for this data?
Ideally, I'd like to grab certain tables on the page but can resort to regular expressions if necessary.
Upvotes: 1
Views: 4725
Reputation: 4789
I recently did some research on this topic. The best resource I found is this Wikipedia article, which gives links to many screen scraping engines.
I needed to have something that I can use as a server and run it in batch, and from my initial investigation, I think Web Harvest is quite good as an open source solution, and I have also been impressed by Screen Scraper, which seems to be very feature rich and you can use it with different languages.
There is also a new project called Scrapy, haven't checked it out yet, but it's a python framework.
Upvotes: 0
Reputation: 34130
You could use the Perl module LWP, with module JavaScript. While this may not be the quickest to set up, it should work reliably. I would definitely not have this be your first foray into Perl though.
Upvotes: 0
Reputation: 5211
If you are familiar with Java (or perhaps, other language that runs on a JVM such as JRuby, Jython, etc.), you can use HTMLUnit; HTMLUnit simulates a complete browser; http requests, creating a DOM for each page and running Javascript (using Mozilla's Rhino).
Additionally, you can run XPath queries on documents loaded in the simulated browser, simulate events, etc.
http://htmlunit.sourceforge.net
Upvotes: 1
Reputation: 31
In compliment to Whaledawg's suggestion, I was going to suggest using an RSS scraper application (do a Google search) and then you can get nice raw XML to programmatically consume instead of a response stream. There may even be a few open-source implementation which would give you more of an idea if you wanted to implement yourself.
Upvotes: 0
If you have Excel then you should be able to import the data from the webpage into Excel.
From the Data menu select Import External Data and then New Web Query.
Once the data is in Excel then you can either manipulate it within Excel or output it in a format (e.g. CSV) you can use elsewhere.
Upvotes: 0
Reputation: 4286
I would recommend Yahoo Pipes, that's exactly what they were built to do. Then you can get the yahoo pipes data as an RSS feed and do as you want with it.
Upvotes: 1
Reputation: 86492
You can look at Beautiful Soup - being open source python, it is easily programmable. Quoting the site:
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:
Upvotes: 2
Reputation: 7912
Give Badboy a try. It's meant to automate the system testing of your websites but you may find it's regular expression rules handy enough to do what you want.
Upvotes: 0
Reputation: 338406
If JavaScript is a must, you can try instantiating an Internet Explorer via ActiveX (CreateObject("InternetExplorer.Application")
) and use it's Navigate2()
Method to open your web page.
Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = True
ie.Navigate2 "http://stackoverflow.com"
After the page has finished loading (check document.ReadyState
), you have full access to the DOM and can use whatever methods to extract any content you like.
Upvotes: 3
Reputation: 28663
You could probably use web app testing tools like Watir, Watin, or Selenium to automate the browser to get the values from the page. I've done this for scraping data before, and it works quite well.
Upvotes: 9