Reputation: 374
I'm trying to write a small script that will extract steam game tags and store them in a csv file. The issue I'm having currently is that I do not know how to remove the html tags from my output. My code is below
from __future__ import absolute_import
import scrapy
from Example.items import SteamItem
from scrapy.selector import HtmlXPathSelector
class SteamSpider(scrapy.Spider):
name = 'steamspider'
allowed_domains = ['https://store.steampowered.com/app']
start_urls = ["https://store.steampowered.com/app/578080/PLAYERUNKNOWNS_BATTLEGROUNDS/",]
def parse(self, response):
hxs = HtmlXPathSelector(response)
tags = hxs.xpath('//*[@id="game_highlights"]/div[1]/div/div[4]/div/div[2]')
for sel in tags:
item = SteamItem()
item['gametags'] = sel.xpath('.//a/text()').extract()
item['gametitle'] = sel.xpath('//html/body/div[1]/div[7]/div[3]/div[1]/div[2]/div[2]/div[2]/div/div[3]/text()').extract()
yield item
My Item class:
class SteamItem(scrapy.Item):
#defining item fields
url = scrapy.Field()
gametitle = scrapy.Field()
gametags = scrapy.Field()
My output then looks like this:
[u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tSurvival\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tShooter\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tMultiplayer\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tPvP\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tThird-Person Shooter\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tFPS\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tAction\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tBattle Royale\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tOnline Co-Op\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tTactical\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tCo-op\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tEarly Access\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tFirst-Person\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tViolent\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tStrategy\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tThird Person\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tCompetitive\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tTeam-Based\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tDifficult\t\t\t\t\t\t\t\t\t\t\t\t',
u'\r\n\t\t\t\t\t\t\t\t\t\t\t\tSimulation\t\t\t\t\t\t\t\t\t\t\t\t'],
My objective is to remove all the tags "u'\r\n\t.....\t
Any ideas?
Thanks!
Upvotes: 11
Views: 9770
Reputation: 401
Since you are using Scrapy framework, you can use a library that comes with Scrapy called w3lib
import w3lib.html
output= w3lib.html.remove_tags(input)
print(output)
scrapy.utils.markup is depreciated in 2019 and please use w3lib instead.
You can refer to https://w3lib.readthedocs.io/en/latest/index.html for more info.
Upvotes: 27
Reputation: 59
Simply Use remove_tags
from scrapy.utils.markup import remove_tags
ToRemove = remove_tags(YourOutPut)
print(ToRemove)
This will solve your problem
Upvotes: 3
Reputation: 10220
Using strip()
is one way to do it. However, if you would like to achieve this entirely using XPath, take a look at normalize-space function. In your case just change extraction of the values to:
item['gametags'] = [a.xpath('normalize-space(.)').extract_first() for a in sel.xpath('.//a')]
item['gametitle'] = sel.xpath('normalize-space(//html/body/div[1]/div[7]/div[3]/div[1]/div[2]/div[2]/div[2]/div/div[3])').extract_first()
Upvotes: 0
Reputation: 1198
You can use strip
method. Since you are using extract()
which will eventually return a list, you can try this.
item['gametags'] = list(map(str.strip, sel.xpath('.//a/text()').extract())
item['gametitle'] = list(map(str.strip, sel.xpath('//html/body/div[1]/div[7]/div[3]/div[1]/div[2]/div[2]/div[2]/div/div[3]/text()').extract())
You can also follow this blog article for steam scraping
Upvotes: 0
Reputation: 799
item['gametags'] = sel.xpath('.//a/text()').extract()
item['gametitle'] = .xpath('//html/body/div[1]/div[7]/div[3]/div[1]/div[2]/div[2]/div[2]/div/div[3]/text()').extract()
strip
your values while extracting as :
item['gametags'] = [val.strip() for val in sel.xpath('.//a/text()').extract()]
Same apply for your second extractor :)
Upvotes: 0
Reputation: 149
To get the title and tags accordingly, you can try the following script. To get rid of tabs and whitespaces you should use .strip()
on .extract_first()
.
import scrapy
class SteamSpider(scrapy.Spider):
name = 'steamspider'
start_urls = ["https://store.steampowered.com/app/578080/PLAYERUNKNOWNS_BATTLEGROUNDS/",]
def parse(self, response):
title = response.xpath("//*[@class='apphub_AppName']/text()").extract_first().strip()
tag_name = [item.strip() for item in response.xpath('//*[contains(@class,"popular_tags")]/*[@class="app_tag"]/text()').extract()]
yield {"title":title,"tagname":tag_name}
Upvotes: 1
Reputation: 1161
The first thing to understand is that what you're trying to remove is not "HTML tags", but simply whitespace, most of which in your case are tab characters, with a few newlines thrown in. You might want to re-title you question to better express this.
As far as stripping the whitespace, the HTML library you're using might provide a function for this.
If it doesn't, or in the more general case of this problem, Python strings have a strip
method (and some relations) that will return the string with all leading and trailing whitespace removed. Thus, you could do something like :
item['field'] = sel.xpath('...').extract().strip()
More info available in the Python manual: https://docs.python.org/2/library/string.html#string.strip
Upvotes: 0