André
André

Reputation: 25554

Webscraping Techniques using PHP or Python

I need to scrape about 100 websites that are very similar in the content that they provide.

My first doubt. Should be possible to write a generic script to scrape all the 100 websites or in scraping techniques is only possible to write scripts for particular websites. (Dumb question.). I think I should ask what possibility is easier. Write 100 different scripts for each website is hard.

Second question. My primary language is PHP, but after searching here on Stackoverflow I found that one of the most advanced scrapers is "Beautiful Soup" in Python. Should be possible to make calls in PHP to "Beautiful Soup" in Python? Or should be better to do all the script in Python?

Give me some clues on how should I go.

Sorry for my weak english.

Best Regards,

Upvotes: 3

Views: 4302

Answers (4)

Matt Billenstein
Matt Billenstein

Reputation: 678

We do something sort of like this with RSS feeds using Python -- we use ElementTree since RSS is usually guaranteed to be well-formed. Beautiful Soup is probably better suited for parsing HTML.

Insofar as dealing with 100 different sites, try to write an abstraction that works on most of them and transforms the page into a common data-structure you can work with. Then override parts of the abstraction to handle individual sites which differ from the norm.

Scrapers are usually I/O bound -- look into coroutine libraries like eventlet or gevent to exploit some I/O parallelism and speed up the whole process.

Upvotes: 0

miku
miku

Reputation: 188014

1.) One scraper for 100 sites? It depends on your requirements. If you only need specific information, you'll need to consider 100 different websites, and their layouts. Some generic functionality could be shared, though.

2.) BeautifulSoup is an HTML/XML parser, not a screen scraper per se. It would be a top choice for the task, if the scraper would be written in python. Calling python from php can be done, but it is certainly not as clean as a single-language solution. Which is why I'd suggest you look into python and BeautifulSoup for the sake of a prototype.

Sidenote: http://scrapy.org/ is another python library, especially designed

to crawl websites and extract structured data from their pages.

Upvotes: 2

JAL
JAL

Reputation: 21563

I've done this a few ways.

1: with grep, sed, and awk. This is about the same as 2: regex. These methods are very direct, but fail whenever the HTML structure of the site changes.

3: PHP's XML/HTML parser DomDocument. This is far more reliable than regex, but I found it annoying to work with (I hate the mixture of PHP arrays and objects). If you want to use PHP, PHPQuery is probably a good solution, as Thai suggested.

4: Python and BeautifulSoup. I can't say enough good things about BeautifulSoup, and this is the method I recommend. I found my code feels cleaner in Python, and BeautifulSoup was very easy and efficient to work with. Good documentation, too.

You will have to specialize your script for each site. It depends on what sort of information you wish to extract. If it was something standard like body title, of course you wouldn't have to change anything, but it's likely the info you want is more specific?

Upvotes: 0

Thai
Thai

Reputation: 11354

Because I prefer PHP rather than Python, I once used phpQuery to scrape data from websites. It works pretty well, and I came up with a scaper pretty quickly, using CSS selectors (with the help of SelectorGadget) to select elements and get the ->text() of it.

But I found it to be a bit slow (since I had to scrape thousands of pages), so in the end I changed it to use regex to scrape data instead. D:

Upvotes: 2

Related Questions