Reputation: 125
Ok, so I have gotten really close and need help crossing the finish line. I have two pieces of text that I want to grab using Scrapy. Here is the format:
<html>
<div id="product-description">
<p>
Blah blah blah text text text reads: 874620. more text text.
<br>
<br>
<b>Brand:</b>
" Nintendo"
<br>
<b>Condition:</b>
" Good"
</p>
</div>
</html>
So far I am only able to grab the bolded headings (Brand:, Condition:), not the text I actually want (Nintendo, Good). Similarly with regex, I am only grabbing "reads:" instead of the string immediately after which is what I want (874620). Here is where I am:
response.xpath('//div[@id="product-description"]//p/b').extract_first()
response.xpath('//div[@id="product-description"]//p').re(r'reads.*')
Upvotes: 2
Views: 893
Reputation: 3744
you can extract the whole text of the <p>
tag and then run regex to extract the relevant information from it
example code:
import re
from scrapy.selector import Selector
html = '''<html>
<div id="product-description">
<p>
Blah blah blah text text text reads: 874620. more text text.
<br>
<br>
<b>Brand:</b>
" Nintendo"
<br>
<b>Condition:</b>
" Good"
</p>
</div>
</html>'''
extracted_text = Selector(text=html).xpath('//div[@id="product-description"]//p//text()').extract()
text = u''.join(extracted_text)
regex = r'reads:\s*(?P<reads>\d+).*Brand:\s*" (?P<brand>\w+)".*Condition:\s*" (?P<condition>\w+)"'
results = re.search(regex, text, flags=re.DOTALL).groupdict()
results['reads'] = int(results['reads'])
print(results)
this code outputs:
{'reads': 874620, 'brand': u'Nintendo', 'condition': u'Good'}
lets see what this code does:
xpath
first, the extracted_text
takes all the text inside the <p>
tag using the xpath //div[@id="product-description"]//p//text()
this xpath means:
note: the //
instead of /
means include on the search for tags also the child of childrens and their childrens and so on.
running this xpath returns list of strings for each text of the tags inside the <p>
tag we found.
after the xpath, we concatenate this list into large string using the u''.join(extracted_text)
.
after getting the full text we want we can run regex to extract the relevant data from it.
regex
lets try to break down the regex and see what it means:
reads:\s*(?P<reads>\d+).*Brand:\s*" (?P<brand>\w+)".*Condition:\s*" (?P<condition>\w+)"
reads:\s*(?P<reads>\d+)
- find me a string starting with reads:
. followed by zero or more white spaces \s*
and create a match group called reads containing \d+
meaning one or more digits.
.*Brand:\s*" (?P<brand>\w+)"
- the above is followed by zero or more characters (any characters), the string Brand:
and again \s*
zero or more white spaces followed by double quote and single space "
. after this create another group called brand which contains \w+
meaning one or more alphanumeric letters.
.*Condition:\s*" (?P<condition>\w+)"
- this is the same as the second part above, create a matching group for condition
the regex is executed with the flag DOTALL which means the .
char matches all characters (including new line) because our match is across multi lines.
after running the above regex we extract the 3 matching groups and convert from string to int the reads match.
I uploaded this example to here which contains detail info and its interactive so you can try it yourself.
Upvotes: 1
Reputation: 21436
For the Nintendo, Good
value you can use following-sibling
feature:
In [1]: sel.xpath('//div[@id="product-description"]//b/following-sibling::text()[1]').extract()
Out[1]: [u'\n " Nintendo"\n ', u'\n " Good"\n ']
You can add regex to avoid the ugly spaces:
In [2]: sel.xpath('//div[@id="product-description"]//b/following-sibling::text()[1]').re('"(.+)"')
Out[2]: [u' Nintendo', u' Good']
Regarding your second issue with regex, try this one:
In [3]: sel.xpath('//div[@id="product-description"]//p').re('reads: (\d+)')
Out[3]: [u'874620']
Upvotes: 1