Reputation: 950
I am building a crawler using Scrapy. I need to get the font-family assigned to a particular HTML element.
Let's say there is a css file, styles.css, which contains the following:
p {
font-family: "Times New Roman", Georgia, Serif;
}
And in the HTML page there is text as follows:
<p>Hello how are you?</p>
Its easy for me to extract the text using Scrapy, however I would also like to know the font-family applied to Hello how are you?
I am hoping it is simply a case of (imaginary XPATH) /p[font-family]
or something like that.
Do you know how I can do this?
Thanks for your thoughts.
Upvotes: 1
Views: 2569
Reputation: 21406
You need to download and parse css seperately. For css parsing you can use tinycss or even regex:
import tinycss
class MySpider(Spider):
name='myspider'
start_urls = [
'http://some.url.com'
]
css_rules = {}
def parse(self, response):
# find css url and parse it
css_url = response.xpath("").extract_first()
yield Request(css_url, self.parse_css)
def parse_css(self, response):
parser = tinycss.make_parser()
stylesheet = parser.parse_stylesheet(response.body)
for rule in stylesheet.rules:
if not getattr(rule, 'selector'):
continue
path = rule.selector.as_css()
css = [d.value.as_css() for d in rule.declarations]
self.css_rules[path] = css
Now you have a dictionary with css paths and their attributes that you can use later in your spider request chain to assign some values:
def parse_item(self, response):
item = {}
item['name'] = response.css('div.name').extract_first()
name_css = []
for k,v in css_rules.items():
if 'div' in k and '.name' in k:
name_css.append(v)
item['name_css'] = name_css
Upvotes: 1