Kate
Kate

Reputation: 330

parse before 2 tag beautifulsoup python

I want to extract all links http://example.com/1 and ignore all link after 2 <br><br> tag with beautifulsoup.

<div class="compost">
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>
<b><a target="_blank" href="http://example.com/2"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>
<b><a target="_blank" href="http://example.com/3"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/3"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>

here is part i need to parse:

<br><b><a target="_blank" href="http://example.com/1"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index25" class="select_index"></span>text 9</a></b>

here is part of my code

    for links in obja.find_all("div", class_="compost"):
        if links.has_attr('href'):
            print links['href']
        #
        aa = links.findAll('a')[0]
        print aa.attrs['href']
        txt = []
        for i in links.findAll('br'):
            txt.append(i.text)
            print i.nextSibling
            if i.nextSibling.text != u'br':
                txt.append(i.nextSibling.text)

        ''.join(txt)

my script extract all links and i don't know how i can extract just all http://example.com/1 and ignore all links after <br><br>?

Upvotes: 0

Views: 58

Answers (1)

Zroq
Zroq

Reputation: 8382

You could just find the first <br><br> and search for hrefs in only that substring.

Like this:

from bs4 import BeautifulSoup

example = """
<div class="compost">
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index18"      class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index20" class="select_index"></span>text 4</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index21" class="select_index"></span>text 5</a></b>
 <br><b><a target="_blank" href="http://example.com/1"><span id="s_index22" class="select_index"></span>text 6</a></b>
 <br><b><a target="_blank" href="http://example.com/1"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/1"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>
<b><a target="_blank" href="http://example.com/2"><span id="s_index18" class="select_index"></span>text 2</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index19" class="select_index"></span>text 3</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index20" class="select_index"></span>text 4</a></b>
 <br><b><a target="_blank" href="http://example.com/2"><span id="s_index21" class="select_index"></span>text 5</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index22" class="select_index"></span>text 6</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index23" class="select_index"></span>text 7</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index24" class="select_index"></span>text 8</a></b>
<br><b><a target="_blank" href="http://example.com/2"><span id="s_index25" class="select_index"></span>text 9</a></b>
<br>
<br>
...."""

br_split = example[0: example.index("<br>\n<br>")]

soup = BeautifulSoup(br_split, "html.parser")

print (soup.find_all("a"))

Outputs: Output

Upvotes: 1

Related Questions