Reputation: 91
I'm trying to scrape information from a list of different webpages. I was able to scrape the list from the site and I can iterate over the list just fine. Where I'm running into trouble is extracting some text that may or may not be found on each page. Originally I was using the xpath and that worked at first. But then the xpath changed. I thought I fixed that issue but I found that another xpath existed for the same information. Now I don't think the xpath will work as I'm trying to use it. Below is three examples that all look similar but have 3 different xpaths.
<h4 style="margin-bottom: 5px;">What our panel thought</4>
<p>
""Light, well-rounded peach, kiwi, barnyard Brett, overly ripe pineapple. Clean lactic acidity, balanced, with restrained funk; lemony and floral; medium body allows acidity to cut through and finish medium-dry. Herbal flavor through the finish, notes of white wine.""
</p>
Xpath:
//*[@id="article-body"]/div[3]/p[2]/text()
Selenium:
driver.find_element_by_xpath('//*[@id="article-body"]/div[3]/p[2]').text
<h4 style="margin-bottom: 5px;">What our panel thought</4>
<p>
"The appearance of this beer begs the name to have the word ‘cloud’ in it. Deep golden haze with a billowy head. Wonderful nose with a blend of citrus and tropical fruits. Compelling flavor profile filled with a blend of orange, peach, pineapple, and guava. Soft pillowy body with a more assertive finish that brings some bitterness to the table to scrub the palate for another sip. Slight hops burn. Pretty awesome beer for which we would gladly regularly reserve a spot in our fridge."
</p>
Xpath:
//*[@id="article-body"]/div[2]/p[2]/text()
Selenium:
driver.find_element_by_xpath('//*[@id="article-body"]/div[2]/p[2]').text
<h4 style="margin-bottom: 5px;">What our panel thought</4>
<p>
<strong>Aroma:</strong>
“Pumpkin notes and a touch of caramel malt with some clove, cinnamon, and nutmeg. This smells like a pumpkin-beer pie: crust, spice, warm, and some malt to make you think beer.”
</p>
<p>
<strong>Flavor:</strong>
“Where the nose was fairly mild, the flavor is much more interesting—a rich malt sweetness up front buffers the clove, nutmeg, and cinnamon. Notes of caramel and toffee with a bit of brown sugar, pumpkin, ginger, and vanilla. Hops bitterness balances nicely. More drinkable than one might expect—it’s not a big and heavy fall seasonal. Toasty crust lingers, reminds of pie. Finishes a bit sweet but nice for the style.”
</p>
<p>
<strong>Overall:</strong>
“Well-crafted pumpkin beer with a nice malt base and a compelling blend of spices. The spicing is bold but balanced. The spices and malt complexity are a delight. Everything works together to make a classic pumpkin beer.”
</p>
Xpaths:
//*[@id="article-body"]/div[3]/p[2]/text()
//*[@id="article-body"]/div[3]/p[3]/text()
//*[@id="article-body"]/div[3]/p[4]/text()
The first two instances were easy to get around using try/except
. The last one is what really gave me trouble because it's being broken up into 3 different <p>
tags.
What I want is all the text after <h4 style="margin-bottom: 5px;">What our panel thought</4>
. I also want to be able to put all the text together into a list like this:
['Light, well-rounded peach, kiwi, barnyard Brett, overly ripe pineapple. Clean lactic acidity, balanced, with restrained funk; lemony and floral; medium body allows acidity to cut through and finish medium-dry. Herbal flavor through the finish, notes of white wine.',
'The appearance of this beer begs the name to have the word ‘cloud’ in it. Deep golden haze with a billowy head. Wonderful nose with a blend of citrus and tropical fruits. Compelling flavor profile filled with a blend of orange, peach, pineapple, and guava. Soft pillowy body with a more assertive finish that brings some bitterness to the table to scrub the palate for another sip. Slight hops burn. Pretty awesome beer for which we would gladly regularly reserve a spot in our fridge.',
'Pumpkin notes and a touch of caramel malt with some clove, cinnamon, and nutmeg. This smells like a pumpkin-beer pie: crust, spice, warm, and some malt to make you think beer. Where the nose was fairly mild, the flavor is much more interesting—a rich malt sweetness up front buffers the clove, nutmeg, and cinnamon. Notes of caramel and toffee with a bit of brown sugar, pumpkin, ginger, and vanilla. Hops bitterness balances nicely. More drinkable than one might expect—it’s not a big and heavy fall seasonal. Toasty crust lingers, reminds of pie. Finishes a bit sweet but nice for the style. Well-crafted pumpkin beer with a nice malt base and a compelling blend of spices. The spicing is bold but balanced. The spices and malt complexity are a delight. Everything works together to make a classic pumpkin beer.']
I'm guessing I can't use the xpath but I'm pretty new to webscraping with selenium so I'm not sure the best course of action after this. Any suggestions would be appreciated.
EDIT: I should add that there are multiple <h4>
tags with <p>
tags under //*[@id="article-body"]
. I'm looking to get the specific one following <h4 style="margin-bottom: 5px;">What our panel thought</4>
.
Upvotes: 0
Views: 818
Reputation: 91
I was able to get it work using:
driver.find_element_by_xpath('//h4[contains(text(),"What our panel thought")]//following-sibling::p').text
Upvotes: 0