Ghofran Challouf
Ghofran Challouf

Reputation: 47

I need to extract id from th class text

I have the HTML code below:

<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31121 status-publish first instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31301 status-publish instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-28416 status-publish last instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">

I need to extract the Id of each product presented in the class description using beatiful soup (31121/ 31301/ 28416 are the ids) how can i do that ?

Upvotes: 0

Views: 45

Answers (2)

HedgeHog
HedgeHog

Reputation: 25196

Iterate over your selection extract class attribute, iterate over its classes and pick class starts with post-:

[c.split('-')[-1] for e in soup.select('div.type-product') for c in e['class'] if c.startswith('post-')]

or

[c.split('-')[-1] for e in soup.select('div[class*="post-"]') for c in e['class'] if c.startswith('post-')]

Example

html = '''
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31121 status-publish first instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31301 status-publish instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-28416 status-publish last instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
'''

soup = BeautifulSoup(html)

[c.split('-')[-1] for e in soup.select('div.type-product') for c in e['class'] if c.startswith('post-')]

output

['31121', '31301', '28416']

Upvotes: 0

stackoverflow1187
stackoverflow1187

Reputation: 11

  • Select all the div's that starts with post-.
  • Iterate all the class names of that div to filter out the classname which starts with post-.
  • add post id to the list.

import re

html_attr='''
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31121 status-publish first instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-31301 status-publish instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">
<div class="col-12 col-sm-6 col-md-6 col-xl-4 product type-product post-28416 status-publish last instock product_cat-tyres has-post-thumbnail purchasable product-type-simple">'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_attr, 'html.parser')
div_list = soup.find_all('div', {"class": re.compile("^post-")})
id_list = []
for div in div_list:
    post_id = [name.split('-')[1] for name in div['class'] if name.startswith('post-')][0]
    id_list.append(post_id)

print(id_list)            

Output

['31121', '31301', '28416']

Upvotes: 1

Related Questions