Mansoor Akram
Mansoor Akram

Reputation: 2056

Unable to get only 1 occurrence of links

I have 3 unique links inside anchor tag in HTML. All links occur 2 times. What i am trying to do is fetch 3 links only once using python regex match but unable to do so and that's it.

Here is my HTML:

<html>
    <body>
        <ul class="asidemenu_h1">
            <li class="top">
            <h3>Mobiles</h3>
            </li>
            <li>
                <a href="http://www.mega.pk/mobiles-apple/" title="Apple Mobiles Price">Apple</a>
            </li>
            <li>
                <a href="http://www.mega.pk/mobiles-asus/" title="Asus Mobiles Price">Asus</a>
            </li>
            <li>
                <a href="http://www.mega.pk/mobiles-black_berry/" title="Black Berry Mobiles Price">Black Berry</a>
            </li>
        </ul>

        <ul class="start2" id="start2ul63" style="visibility: hidden; opacity: 0;">
            <li>
            <h3>Mobiles</h3>
                <ul class="start3 bolder-star">
                    <li>
                        <a href="http://www.mega.pk/mobiles-apple/">Apple</a>
                    </li>
                    <li>
                        <a href="http://www.mega.pk/mobiles-asus/">Asus</a>
                    </li>
                    <li>
                        <a href="http://www.mega.pk/mobiles-black_berry/">Black Berry</a>
                    </li>
                </ul>
            </li>
        </ul>
    </body>
</html>

Here is my approach(1) using for loop with regex match:

for link in soup.find_all("a", href=re.compile(r'(http:\/\/www\.mega\.pk\/mobiles-[A-z]+\/)(?=.*\1)', re.DOTALL)):
    link.get('href')

This returns nothing at all.

Here is my approach(2) using for loop with regex match:

for link in soup.find_all("a", href=re.compile(r'(http:\/\/www\.mega\.pk\/mobiles-\w+\/)(?!.*\1)', re.UNICODE | re.DOTALL)):
    link.get('href')

This returns repeated links as well.

Upvotes: 1

Views: 58

Answers (1)

alecxe
alecxe

Reputation: 473763

Get all links having mobiles inside href with a CSS selector:

soup.select("ul.asidemenu_h1 a[href*=mobiles]")

Note that I'm restricting it to search for the links inside the ul having asidemenu_h1 class - this alone would help you to avoid duplicates. *= here means "contains".


If you insist on using regular expressions to check the href values:

menu = soup.find("ul", class_="asidemenu_h1")
links = menu.find_all("a", href=re.compile(r"mega\.pk\/mobiles-[a-zA-Z0-9_-]+\/$"))
for link in links:
    print(link.get_text())

Upvotes: 1

Related Questions