Reputation: 111
I am working on a project that involves scraping text from thousands of websites for small organizations. I am new to R and have no experience with web scraping before this project. Here is my code with an example website:
library(rvest)
soya<-read_html("http://www.soyaquaalliance.com/")
all_text<-soya %>%
html_nodes("div") %>%
html_text()
gsub('[\r\n\t]', '',all_text)
sink(file="C:soya.txt")
cat(all_text)
sink(NULL)
My goal is to scrape everything within a domain and export it to an individual txt file for each site. I have tried lapply
but it seems to require knowledge about the format for each website.
Is there a general function that will scrape through all the text from pages within each site?
Upvotes: 0
Views: 334
Reputation: 78792
I'm not writing half a dozen comments. This is not really an answer-answer so pedants can tick this down if they feel so moved.
"Everything in a domain"…
You mean crawl an entire web site tree from the starting /
?
If so, R is not the right tool for this. Neither is Python. Python has some frameworks that could work and R has this but you're going to be doing quite a bit of programming to take care of edge cases, exceptions, parsing problems, encoding issues, etc. It makes zero sense to reinvent the wheel.
There are at-scale but very easy-to-run technologies like Heritrix which are purpose built for this. You'd need to spend time coding up an R solution for scraping so take that time to read up on Heritrix and use one of the handy docker containers for it to get it going so you don't have to become an expert on the dependencies for it. Once you run the container it's just editing a config file and a cpl clicks in the web interface.
It generates WARC files and I've got a few packages that read that format (just poke around my GH – hrbrmstr
there as well).
You also just can't "scrape what you want" b/c "darnit you and some super self-important org or two" want to. There are rules. Heritrix will follow robots.txt
restrictions (and you should not override those as your need isn't greater than what a site wants folks to do). Those rules restrict paths and tell you how fast you can crawl. Just b/c you want to do something super fast doesn't give you the right to. You're no more important the the folks paying to run sites.
There are also Terms & Conditions/Terms of Service that just got more legal beef at least in the U.S. LinkedIn and others have successfully sued scrapers for gobs of cash. If I had T&Cs setup to restrict scraping and you did it to me, I'd definitely sue you and encourage others to esp since I do monitor for that type of access. It's unethical and increasingly illegal to violate an agreement some content provider established. Again, just b/c you want to do something and can technically do something does not give you the right to do something.
I go into all that b/c it sounds like your super-new to scraping and are going to make assumptions that can potentially get you into legal trouble if not getting your IP address banned in dozens of network segments on the internet.
I also go into it to seriously encourage you to use a real scraping platform for this and then do data processing in R after the fact. I do this for a living and would never consider using R or Python for a task such as you are suggesting.
There are alternatives to Heritrix.
You're likely going to get Python pedants commenting to this that Scrapy is "a totally a good solution, dude". Listen to them at your peril.
I suspect I will not have convinced you with this but hopefully it will dissuade others from taking the hard path to their ultimate goal.
Upvotes: 1
Reputation: 447
To get all the text into a text like vector, use readLines
.
soya <- readLines("http://www.soyaquaalliance.com/")
soya[1:10]
[1] "<!DOCTYPE html> "
[2] "<html lang=\"en-US\">"
[3] "<head>"
[4] "\t<meta charset=\"UTF-8\">"
[5] "\t<title>Soy Aquaculture Alliance | Building partnerships for abundant, healthy, homegrown seafood.</title>"
[6] "\t<link rel=\"pingback\" href=\"http://www.soyaquaalliance.com/xmlrpc.php\">"
[7] "\t<meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0, maximum-scale=1.0\">"
[8] "\t\t"
[9] "\t<link rel='dns-prefetch' href='//maps.googleapis.com' />"
[10] "<link rel='dns-prefetch' href='//fonts.googleapis.com' />"
This gives you the raw HTML text used to build the site, through which you'll have to parse the different nodes using REGEX.
For example, to find the lines where the tweets occur:
soya[grep('.*class="widget widget_twitter_widget">.*', soya) + 2] %>% trimws
[1] "The perfect salad for spring [some URL]"
Upvotes: 0