R web scrape across domain

Question

I am working on a project that involves scraping text from thousands of websites for small organizations. I am new to R and have no experience with web scraping before this project. Here is my code with an example website:

library(rvest)
soya<-read_html("http://www.soyaquaalliance.com/")

all_text<-soya %>%
  html_nodes("div") %>%
  html_text()
gsub('[
	]', '',all_text)

sink(file="C:soya.txt")
cat(all_text)
sink(NULL)

My goal is to scrape everything within a domain and export it to an individual txt file for each site. I have tried lapply but it seems to require knowledge about the format for each website.

Is there a general function that will scrape through all the text from pages within each site?

hrbrmstr · Accepted Answer

I'm not writing half a dozen comments. This is not really an answer-answer so pedants can tick this down if they feel so moved.

"Everything in a domain"…

You mean crawl an entire web site tree from the starting /?

If so, R is not the right tool for this. Neither is Python. Python has some frameworks that could work and R has this but you're going to be doing quite a bit of programming to take care of edge cases, exceptions, parsing problems, encoding issues, etc. It makes zero sense to reinvent the wheel.

There are at-scale but very easy-to-run technologies like Heritrix which are purpose built for this. You'd need to spend time coding up an R solution for scraping so take that time to read up on Heritrix and use one of the handy docker containers for it to get it going so you don't have to become an expert on the dependencies for it. Once you run the container it's just editing a config file and a cpl clicks in the web interface.

It generates WARC files and I've got a few packages that read that format (just poke around my GH – hrbrmstr there as well).

You also just can't "scrape what you want" b/c "darnit you and some super self-important org or two" want to. There are rules. Heritrix will follow robots.txt restrictions (and you should not override those as your need isn't greater than what a site wants folks to do). Those rules restrict paths and tell you how fast you can crawl. Just b/c you want to do something super fast doesn't give you the right to. You're no more important the the folks paying to run sites.

There are also Terms & Conditions/Terms of Service that just got more legal beef at least in the U.S. LinkedIn and others have successfully sued scrapers for gobs of cash. If I had T&Cs setup to restrict scraping and you did it to me, I'd definitely sue you and encourage others to esp since I do monitor for that type of access. It's unethical and increasingly illegal to violate an agreement some content provider established. Again, just b/c you want to do something and can technically do something does not give you the right to do something.

I go into all that b/c it sounds like your super-new to scraping and are going to make assumptions that can potentially get you into legal trouble if not getting your IP address banned in dozens of network segments on the internet.

I also go into it to seriously encourage you to use a real scraping platform for this and then do data processing in R after the fact. I do this for a living and would never consider using R or Python for a task such as you are suggesting.

There are alternatives to Heritrix.

Mixnode (https://www.mixnode.com/) has a freemium service for this (I'm not affiliated with them at all besides tried it out once)
Nutch (http://nutch.apache.org/) is more modern than Heritrix but it's harder to get running. It's really nice once you get it working.
StormCrawler (http://stormcrawler.net/) — another Apache project — is based on Apache Storm.

You're likely going to get Python pedants commenting to this that Scrapy is "a totally a good solution, dude". Listen to them at your peril.

I suspect I will not have convinced you with this but hopefully it will dissuade others from taking the hard path to their ultimate goal.

R web scrape across domain

Answers (2)

Related Questions