Reputation: 105
I need to get date out from each html files.
I tried find_siblings('p'), but returns None
.
Date is under tags below (mostly the third p
tag) but sometimes is with the first tag of id="a-body"
<div class="sa-art article-width" id="a-body" itemprop="articleBody">
<p class="p p1">text1</p>
<p class="p p1">text2</p>
<p class="p p1">
January 6, 2009 8:00 am ET
</p>
..
..
..
</div>
or
Inside the first tag but include other information.
<div class="sa-art article-width" id="a-body" itemprop="articleBody">
<p class="p p1">
participant text1 text2 text3 January 8, 2009 5:00 PM ET
</p>
<p class="p p1">text</p>
<p class="p p1">text</p>
..
..
</div>
My code is just simply to find the third p
, but if it's within the first p
with other content, I don't know how to do it:
fo = open('C:/Users/output1/4069369.html', "r")
soup = bs4.BeautifulSoup(fo, "lxml")
d_date = soup.find_all('p')[2]
print d_date.get_text(strip=True)
Upvotes: 0
Views: 648
Reputation: 4975
It is better identify a unique common pattern to be use... if you cannot rely on tag's attribute why don't use the string? Each date end with a ET
so use this info like this
tag_dates = soup.find_all(lambda t: str(t.string).endswith('ET'), string=True)
dates = [str(t.string) for t in tag_dates] # list of all dates
Upvotes: 0
Reputation: 6556
The thing is that you have to find the element p
with date
, then you can work with a months list, like this:
from bs4 import BeautifulSoup
div_test='<div class="sa-art article-width" id="a-body" itemprop="articleBody">\
<p class="p p1">text1</p>\
<p class="p p1">\
participant text1 text2 text3 January 8, 2009 5:00 a.m. EST\
</p>\
<p class="p p1">text2</p>\
<p class="p p1">\
January 6, 2009 8:00 pm ET\
</p></div>'
soup = BeautifulSoup(div_test, "lxml")
month_list = ['January','February','March','April','May','June','July','August','September','October','November','December']
def first_date_p():
for p in soup.find_all('p',{"class":"p p1"}):
for month in month_list:
if month in p.get_text():
first_date_p = p.get_text()
date_start= first_date_p.index(month)
date_text = first_date_p[date_start:]
return date_text
first_date_p()
It will output the first p
element which has date
, no matter the element's position, in other words, it contains month:
u'January 8, 2009 5:00 a.m. EST'
Upvotes: 1
Reputation:
With provided code It's not really clear that really happens, but i guess, you are trying to find against root of page. try if it's work like this:
d_date = soup.find_all('div', { "id" : "a-body" })[0].find_all("p")[0]
print d_date.get_text(strip=True)
Update:
for page in pages:
soup = BeautifulSoup(page,'html.parser')
if soup.find_all("p")[2].get_text():
d_date = soup.find_all("p")[2]
print d_date.get_text(strip=True)
else:
d_date = soup.find_all("p")[0]
print d_date.get_text(strip=True)
Upvotes: 0