need_halp
need_halp

Reputation: 115

webscraping in python: copying specific part of HTML for each webpage

I am working on a webscraper using html requests and beautiful soup (New to this). For 1 webpage (https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html) I am trying to scrape a part, which I will replicate for other products. The html looks like:

<span class="js-enhanced-ecommerce-data hidden" data-product-title="Illamasqua Expressionist Artistry Palette" data-product-id="12024086" data-product-category="" data-product-is-master-product-id="false" data-product-master-product-id="12024086" data-product-brand="Illamasqua" data-product-price="£39.00" data-product-position="1">
</span>

I want to select the data-product-brand="Illamasqua" , specifically the Illamasqua. I am not sure how to grab this using html requests or Beautifulsoup. I tried:

r.html.find("span.data-product-brand", first=True)

But this was unsuccesful. Any help would be appreiciated.

Upvotes: 0

Views: 507

Answers (2)

sinisake
sinisake

Reputation: 11328

You can get element(s) with specified data attribute directly:

from requests_html import HTMLSession
session = HTMLSession()

r = session.get('https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html')
span=r.html.find('[data-product-brand]',first=True)
print(span)

3 results, and you need just first, i guess.

Upvotes: 1

Jonathan Leon
Jonathan Leon

Reputation: 5648

Because you tagged beautifulsoup, here's a solution for using that package

from bs4 import BeautifulSoup
import requests

page = requests.get('https://www.lookfantastic.com/illamasqua-artistry-palette-experimental/11723920.html')
soup = BeautifulSoup(page.content, "html.parser")

# there are multiple matches for the class that contains the word 'Illamasqua', which is what I think you want in the end???
# you can loop through and get the brand like this; in this case there are three
for l in soup.find_all(class_="js-enhanced-ecommerce-data hidden"):
    print(l.get('data-product-brand'))

# if it's always going to be the first, you can just do this
soup.find(class_="js-enhanced-ecommerce-data hidden").get('data-product-brand')

Upvotes: 2

Related Questions