Reputation: 330
I want to extract all links http://example.com/1 and ignore all link after 2 <br><br>
tag with beautifulsoup.
<div class="compost">
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>
<b><a target="_blank" href="http://example.com/2"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>
<b><a target="_blank" href="http://example.com/3"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>
here is part i need to parse:
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index25" class="select_index"></span>text 9</a></b>
here is part of my code
for links in obja.find_all("div", class_="compost"):
if links.has_attr('href'):
print links['href']
#
aa = links.findAll('a')[0]
print aa.attrs['href']
txt = []
for i in links.findAll('br'):
txt.append(i.text)
print i.nextSibling
if i.nextSibling.text != u'br':
txt.append(i.nextSibling.text)
''.join(txt)
my script extract all links and i don't know how i can extract just all http://example.com/1 and ignore all links after <br><br>
?
Upvotes: 0
Views: 58
Reputation: 8382
You could just find the first <br><br>
and search for hrefs in only that substring.
Like this:
from bs4 import BeautifulSoup
example = """
<div class="compost">
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>
<b><a target="_blank" href="http://example.com/2"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>
...."""
br_split = example[0: example.index("<br>\n<br>")]
soup = BeautifulSoup(br_split, "html.parser")
print (soup.find_all("a"))
Upvotes: 1