Reputation: 289
I have below piece of html and need to extract only text from there between
<p>Current</p> and <p>Archive</p>
Html chunk looks like:
<p>Current</p>
<a href="some link to somewhere 1">File1</a>
<br>
<a href="some link to somewhere 2">File2</a>
<br>
<a href="some link to somewhere 3">File3</a>
<br>
<p>Archive</p>
<a href="Some another link to another file">Some another file</a>
so the desired output should looks like File1, File2, File3.
This is what I've tried so far
import re
m = re.compile('<p>Current</p>(.*?)<p>Archive</p>').search(text)
but doesn't work as expected.
Is there any simple solution how to extract text between specified chunks of html tags in python?
Upvotes: 3
Views: 384
Reputation: 1563
from bs4 import BeautifulSoup as bs
html_text = """
<p>Current</p>
<a href="some link to somewhere 1">File1</a>
<br>
<a href="some link to somewhere 2">File2</a>
<br>
<a href="some link to somewhere 3">File3</a>
<br>
<p>Archive</p>
<a href="Some another link to another file">Some another file</a>"""
a_tag = soup.find_all("a")
text = []
for i in a_tag:
text.append(get_text())
print (text)
Output:
['File1', 'File2', 'File3', 'Some another file']
BeautifulSoup library will be very useful for parsing html files and getting text from them.
Upvotes: 0
Reputation: 51683
If you insist upon using regex you can use it in combination with list comp like so:
chunk="""<p>Current</p>
<a href="some link to somewhere 1">File1</a>
<br>
<a href="some link to somewhere 2">File2</a>
<br>
<a href="some link to somewhere 3">File3</a>
<br>
<p>Archive</p>
<a href="Some another link to another file">Some another file</a>"""
import re
# find all things between > and < the shorter the better
found = re.findall(r">(.+?)<",chunk)
# only use the stuff after "Current" before "Archive"
found[:] = found[ found.index("Current")+1:found.index("Archive")]
print(found) # python 3 syntax, remove () for python2.7
Output:
['File1', 'File2', 'File3']
Upvotes: 1