Reputation: 11
I have been trying to create a program (I will share the code below) that will scan each and every page it can find on a domain, then scrape all of the text that is contained on the site.
I have created a program that seems to take all of the text from each page however all of the information is "lost" in all of the website code and is showing as this.
n\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t', 'Dry lining is a system for cladding the internal faces of buildin gs, such as walls and ceilings when with plasterboard when "wet" plaster is not required.', '\t\t', '\n\t\t\t\t', '\n\t\t\t\t\t\t', '\n\t\t\t', '\n\t\t', '\n\t\t\t\t\t\t', '\n\t\t\t', '\n\ ', '\n\t\t\t\t\t\t', '\n\t\t\t', '\n\t\t', '\n\t\t\t\t', '\n\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t', 'Coving', '\t\t', '\n\t\t\t\t', '\n\ t\t\t\t', '\n\t\t\t\t', '\n\t\t\
Can anyone help me clean up the text so that I am left with just the relevant information please!
Here is the code:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'c'
allowed_domains = ['billsplastering.co.uk']
start_urls = ['https://www.billsplastering.co.uk/']
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
print(response.css("::text").extract())```
Upvotes: 0
Views: 109
Reputation: 75
Try this. It removes leading and trailing blanks from the list you posted above the code. Then it filters out the empty strings.
list_body = [
'n\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t',
'Dry lining is a system for cladding the internal faces of buildin gs, such as walls and ceilings when with plasterboard when "wet" plaster is not required.',
'\t\t', '\n\t\t\t\t', '\n\t\t\t\t\t\t', '\n\t\t\t', '\n\t\t',
'\n\t\t\t\t\t\t', '\n\t\t\t', '\n\ ', '\n\t\t\t\t\t\t', '\n\t\t\t', '\n\t\t',
'\n\t\t\t\t', '\n\t\t\t', '\n\t\t\t\t\t', '\n\t\t\t\t', '\n\t\t\t\t',
'\n\t\t\t', 'Coving', '\t\t', '\n\t\t\t\t', '\n\ t\t\t\t', '\n\t\t\t\t', '\n\t\t\t']
# Strip blanks from items in list
list_no_blanks = [text.strip() for text in list_body]
# Filter out empty strings
list_filter = list(filter(lambda x: x != "", list_no_blanks))
Upvotes: 1