Carl H
Carl H

Reputation: 1036

Python: parsing texts between keywords

I am seeking to use BeautifulSoup to parse texts on a type of webpage and the codes are below:

import urllib 
import re

html = urllib.urlopen('http://english.hani.co.kr/arti/english_edition/e_national/714507.html').read()
content= str(soup.find("div",  class_="article-contents"))

So my goal is to parse out at least the first sentence or first few sentences in the first paragraph.

Because the paragraphs are not surrounded by the <p> tag, my best strategy so far is to find, within content, the texts that go between </h4> and <p> (which happens to be the first paragraph)

Here is how the target texts look like:

<div class="article-contents">
<div class="article-alignC">
<table class="photo-view-area">
<tr>
<td>
<img alt="" border="0" src="http://img.hani.co.kr/imgdb/resize/2015/1024/00542577201_20151024.JPG" style="width:590px;"/>
</td>
</tr>
</table>
</div>
<h4></h4>

(This is where the contents I want to parse, between <h4> and <p>) <p align="justify"></p>

I am trying to do this straight on BeautifulSoup or use Regular Expression, but am still unsuccessful so far.

Upvotes: 3

Views: 267

Answers (1)

alecxe
alecxe

Reputation: 473873

Locate the the h4 element and find the first next text sibling using find_next_sibling():

h4 = soup.select_one("div.article-contents > h4")
print(h4.find_next_sibling(text=True))

Prints:

US scholar argues that any government attempt to impose single view of history is misguided On Oct. 19, the Hankyoreh’s Washington correspondent conducted on interview with phone and email with William North, chair of the history department at Carleton University in Minnesota. The main topic of the discussion was the efforts of the administration of South Korean President Park Geun-hye to take over the production of history textbooks. 

Well, actually, just using .next_sibling is good enough here:

print(h4.next_sibling)

Upvotes: 3

Related Questions