user683742
user683742

Reputation: 80

How to parse html using beautifulsoup/python?

How do i parse the date start and date end value using beautifulsoup?

<h2 name="PRM-013113-21017-0FSNS" class="pointer">
    <a name="PRM-013113-21017-0FSNS">Chinese New Year Sale<br>
       <span>February 8, 2013 - February 10, 2013</span>
    </a>
</h2>

Upvotes: -1

Views: 234

Answers (1)

Amyth
Amyth

Reputation: 32949

Something like this.

import re
from BeautifulSoup import BeautifulSoup

html = '<h2 name="PRM-013113-21017-0FSNS" class="pointer"><a name="PRM-013113-21017-0FSNS">Chinese New Year Sale<br><span>February 8, 2013 - February 10, 2013</span></a></h2>'
date_span = BeautifulSoup(html).findAll('h2', {'class' : 'pointer'})[0].findAll('span')[0]
date = re.findall(r'<span>(.+?)</span>', str(date_span))[0]

(PS: you can also use BeautifulSoup's text=True method with findAll to get the text instead of using regex as follows.)

from BeautifulSoup import BeautifulSoup

html = '<h2 name="PRM-013113-21017-0FSNS" class="pointer"><a name="PRM-013113-21017-0FSNS">Chinese New Year Sale<br><span>February 8, 2013 - February 10, 2013</span></a></h2>'
date = BeautifulSoup(test).findAll('h2', {'class' : 'pointer'})[0].findAll('span')[0]
date = date.findAll(text=True)[0]

Update::

To have a start and end date as separate variables you can simply split them you can simply split the date variable as follows:

from BeautifulSoup import BeautifulSoup

html = '<h2 name="PRM-013113-21017-0FSNS" class="pointer"><a name="PRM-013113-21017-0FSNS">Chinese New Year Sale<br><span>February 8, 2013 - February 10, 2013</span></a></h2>'
date = BeautifulSoup(test).findAll('h2', {'class' : 'pointer'})[0].findAll('span')[0]
date = date.findAll(text=True)[0]
# Get start and end date separately
date_start, date_end = date.split(' - ')

now date_start variable contains the starting date and date_end variable contains the ending date.

Upvotes: 1

Related Questions