Reputation: 1036
I am seeking to use BeautifulSoup to parse texts on a type of webpage and the codes are below:
import urllib
import re
html = urllib.urlopen('http://english.hani.co.kr/arti/english_edition/e_national/714507.html').read()
content= str(soup.find("div", class_="article-contents"))
So my goal is to parse out at least the first sentence or first few sentences in the first paragraph.
Because the paragraphs are not surrounded by the <p>
tag, my best strategy so far is to find, within content, the texts that go between </h4>
and <p>
(which happens to be the first paragraph)
Here is how the target texts look like:
<div class="article-contents">
<div class="article-alignC">
<table class="photo-view-area">
<tr>
<td>
<img alt="" border="0" src="http://img.hani.co.kr/imgdb/resize/2015/1024/00542577201_20151024.JPG" style="width:590px;"/>
</td>
</tr>
</table>
</div>
<h4></h4>
(This is where the contents I want to parse, between <h4>
and <p>
)
<p align="justify"></p>
I am trying to do this straight on BeautifulSoup or use Regular Expression, but am still unsuccessful so far.
Upvotes: 3
Views: 267
Reputation: 473873
Locate the the h4
element and find the first next text sibling using find_next_sibling()
:
h4 = soup.select_one("div.article-contents > h4")
print(h4.find_next_sibling(text=True))
Prints:
US scholar argues that any government attempt to impose single view of history is misguided On Oct. 19, the Hankyoreh’s Washington correspondent conducted on interview with phone and email with William North, chair of the history department at Carleton University in Minnesota. The main topic of the discussion was the efforts of the administration of South Korean President Park Geun-hye to take over the production of history textbooks.
Well, actually, just using .next_sibling
is good enough here:
print(h4.next_sibling)
Upvotes: 3