Reputation: 57
I'm currently writing a scraper with scrapy. I want to crawl all the text that is shown on the website, not a single page, but all the subpages aswell. I#m using CrawlSpider, because I think it's made for scraping the other pages aswell. Here is the code I wrote so far:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.exporters import XmlItemExporter
class MySpider(CrawlSpider):
name = 'eship2'
allowed_domains = ['tlk-energy.com']
start_urls = ['http://www.tlk-energy.com']
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)] # Follow any link scrapy finds (that is allowed).
def parse_item(self, response):
item = dict()
item['url'] = response.url
item['title'] = response.meta['link_text']
item['body'] = '\n'.join(response.xpath('//text()').extract())
return item
I get an output which suits my wishes very well but I still have a lot of tabs and spaces like this one:
> Wärmepumpen- Klimakreislauf E-Fahrzeug
>
>
>
>
>
>
>
>
>
>
>
>
>
> Projektbeschreibung
>
> Nulla at nulla justo, eget luctus tortor. Nulla facilisi. Duis aliquet
> egestas purus in blandit. Curabitur vulputate, ligula lacinia
> scelerisque tempor, lacus lacus ornare ante, ac egestas est urna sit
> amet arcu.
and also some text like this one:
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-77796939-1', 'auto');
ga('send', 'pageview');
I just want a file, for example xml where the text of the website is shown and maybe the url where the text was found.
Upvotes: 1
Views: 177
Reputation: 1981
You need to add some postprocessing to clear you results:
To remove javascript and css text from your results use this:
results = response.xpath(
'//*[not(self::script or self::style)]/text()'
).extract()
Then apply strip
and if
to remove empty lines:
text = " ".join([x.strip() for x in results if x.strip()])
Upvotes: 3