i love stackoverflow
i love stackoverflow

Reputation: 1685

What general approach can I take to parse the contents of a website?

Say someone else has a website generated by JavaScript, so I can't go look at the source and read what should be on the screen. How can I grab the text on the screen so I can feed it into another program? Also, how can I write a program that automatically clicks on radio buttons, links, etc. that satisfy certain criteria?

Upvotes: 1

Views: 169

Answers (3)

torrential coding
torrential coding

Reputation: 1765

You can write a web scraping tool in Perl or Python. Or, you can use existing tools and frameworks to achieve that.

Check out Scrapy, an open-source tool written in Python.

Take a look at Selenium too.

Upvotes: 1

John Saunders
John Saunders

Reputation: 161773

If you need to handle content generated by script, then your first problem is to cause the script to execute. Further, the script will want to generate the content into a DOM. That means you need to have a DOM, and a script engine, and probably HTTP access to the Internet, and XML handling, etc.

If that sounds a lot like a web browser, then you're listening.

What you basically need is a web browser that you can control from a program. You'll need to be able to tell it to browse to a page, click buttons and links, etc., then you'll need to read back the resulting DOM.

Only then will you need to parse the page.

If you're in the Microsoft world, then you can use the WebBrowser control. There are several forms of this, and they all amount to the same thing: you can have Internet Explorer run inside of your program, and your program can control it.

I understand there are other browsers that can be controlled from a program, but since I don't know their details, I'll wait for someone else to tell us both.

Upvotes: 1

Nathan
Nathan

Reputation: 4067

To parse dynamic content you could see the javascript source and get that same content the same way the webpage is getting it. (ie. replicating ajax calls and such)

If you want to submit data (not actually click on the elements) as if it were clicked/edited/selected you could also send a request containing the same data that the server is expecting by using some HTTP library, like CURL. See an example here.

Upvotes: 1

Related Questions