thaneofcawdor
thaneofcawdor

Reputation: 125

Grabbing text inside a <p> tag in between <b> tags plus a regex issue

Ok, so I have gotten really close and need help crossing the finish line. I have two pieces of text that I want to grab using Scrapy. Here is the format:

<html>
  <div id="product-description">
    <p>
      Blah blah blah text text text reads: 874620. more text text.
      <br>
      <br>
      <b>Brand:</b>
      " Nintendo"
      <br>
      <b>Condition:</b>
      " Good"
    </p>
  </div>
</html>

So far I am only able to grab the bolded headings (Brand:, Condition:), not the text I actually want (Nintendo, Good). Similarly with regex, I am only grabbing "reads:" instead of the string immediately after which is what I want (874620). Here is where I am:

response.xpath('//div[@id="product-description"]//p/b').extract_first()

response.xpath('//div[@id="product-description"]//p').re(r'reads.*')

Upvotes: 2

Views: 893

Answers (2)

ShmulikA
ShmulikA

Reputation: 3744

you can extract the whole text of the <p> tag and then run regex to extract the relevant information from it

example code:

import re
from scrapy.selector import Selector

html = '''<html>
  <div id="product-description">
    <p>
      Blah blah blah text text text reads: 874620. more text text.
      <br>
      <br>
      <b>Brand:</b>
      " Nintendo"
      <br>
      <b>Condition:</b>
      " Good"
    </p>
  </div>
</html>'''

extracted_text = Selector(text=html).xpath('//div[@id="product-description"]//p//text()').extract()
text = u''.join(extracted_text)

regex = r'reads:\s*(?P<reads>\d+).*Brand:\s*" (?P<brand>\w+)".*Condition:\s*" (?P<condition>\w+)"'
results = re.search(regex, text, flags=re.DOTALL).groupdict()

results['reads'] = int(results['reads'])
print(results)

this code outputs:

{'reads': 874620, 'brand': u'Nintendo', 'condition': u'Good'}

Update:

lets see what this code does:

xpath

first, the extracted_text takes all the text inside the <p> tag using the xpath //div[@id="product-description"]//p//text() this xpath means:

  • give me all the divs that has an id attribute matching the "product-description"
  • give me all the p tags inside the divs above
  • get the text out of these p tags

note: the // instead of / means include on the search for tags also the child of childrens and their childrens and so on.

running this xpath returns list of strings for each text of the tags inside the <p> tag we found.

after the xpath, we concatenate this list into large string using the u''.join(extracted_text).

after getting the full text we want we can run regex to extract the relevant data from it.

regex

lets try to break down the regex and see what it means:

reads:\s*(?P<reads>\d+).*Brand:\s*" (?P<brand>\w+)".*Condition:\s*" (?P<condition>\w+)"

reads:\s*(?P<reads>\d+) - find me a string starting with reads:. followed by zero or more white spaces \s* and create a match group called reads containing \d+ meaning one or more digits.

.*Brand:\s*" (?P<brand>\w+)" - the above is followed by zero or more characters (any characters), the string Brand: and again \s* zero or more white spaces followed by double quote and single space ". after this create another group called brand which contains \w+ meaning one or more alphanumeric letters.

.*Condition:\s*" (?P<condition>\w+)" - this is the same as the second part above, create a matching group for condition

the regex is executed with the flag DOTALL which means the . char matches all characters (including new line) because our match is across multi lines.

after running the above regex we extract the 3 matching groups and convert from string to int the reads match.

I uploaded this example to here which contains detail info and its interactive so you can try it yourself.

Upvotes: 1

Granitosaurus
Granitosaurus

Reputation: 21436

For the Nintendo, Good value you can use following-sibling feature:

In [1]: sel.xpath('//div[@id="product-description"]//b/following-sibling::text()[1]').extract()
Out[1]: [u'\n      " Nintendo"\n      ', u'\n      " Good"\n    ']

You can add regex to avoid the ugly spaces:

In [2]: sel.xpath('//div[@id="product-description"]//b/following-sibling::text()[1]').re('"(.+)"')
Out[2]: [u' Nintendo', u' Good']

Regarding your second issue with regex, try this one:

In [3]: sel.xpath('//div[@id="product-description"]//p').re('reads: (\d+)')
Out[3]: [u'874620']

Upvotes: 1

Related Questions