Yazid Yaakub
Yazid Yaakub

Reputation: 57

How REGEX can help to extract only the link that contains certain words?

I am trying to scrape a website and I want to extract a link that have this pattern: /market_information/announcements/company_announcement/announcement_details?ann_id=

Is it possible to get it done using regex? Below is my script :

title = soup.find_all('tbody')
for i in title:
    for link in i.find_all('a'):
        print(link['href'])

/trade/trading_resources/listing_directory/company-profile?stock_code=7374
/market_information/announcements/company_announcement/announcement_details?ann_id=393738
/trade/trading_resources/listing_directory/company-profile?stock_code=1201
/market_information/announcements/company_announcement/announcement_details?ann_id=393742
/trade/trading_resources/listing_directory/company-profile?stock_code=6874
/market_information/announcements/company_announcement/announcement_details?ann_id=393583
/trade/trading_resources/listing_directory/company-profile?stock_code=4634
/market_information/announcements/company_announcement/announcement_details?ann_id=393572
/trade/trading_resources/listing_directory/company-profile?stock_code=8176
/market_information/announcements/company_announcement/announcement_details?ann_id=393745
/trade/trading_resources/listing_directory/company-profile?stock_code=9474
/market_information/announcements/company_announcement/announcement_details?ann_id=393579
/trade/trading_resources/listing_directory/company-profile?stock_code=4561
/market_information/announcements/company_announcement/announcement_details?ann_id=393743
/trade/trading_resources/listing_directory/company-profile?stock_code=2577
/market_information/announcements/company_announcement/announcement_details?ann_id=393576
/trade/trading_resources/listing_directory/company-profile?stock_code=2984
/market_information/announcements/company_announcement/announcement_details?ann_id=393575
/trade/trading_resources/listing_directory/company-profile?stock_code=2828
/market_information/announcements/company_announcement/announcement_details?ann_id=393739
/trade/trading_resources/listing_directory/company-profile?stock_code=6874
/market_information/announcements/company_announcement/announcement_details?ann_id=393737
/trade/trading_resources/listing_directory/company-profile?stock_code=6181
/market_information/announcements/company_announcement/announcement_details?ann_id=393748
/trade/trading_resources/listing_directory/company-profile?stock_code=2984
/market_information/announcements/company_announcement/announcement_details?ann_id=393582
/trade/trading_resources/listing_directory/company-profile?stock_code=0021
/market_information/announcements/company_announcement/announcement_details?ann_id=393578
/trade/trading_resources/listing_directory/company-profile?stock_code=5028
/market_information/announcements/company_announcement/announcement_details?ann_id=393740
/trade/trading_resources/listing_directory/company-profile?stock_code=6246
/market_information/announcements/company_announcement/announcement_details?ann_id=393573
/trade/trading_resources/listing_directory/company-profile?stock_code=1201
/market_information/announcements/company_announcement/announcement_details?ann_id=393571
/trade/trading_resources/listing_directory/company-profile?stock_code=7143
/market_information/announcements/company_announcement/announcement_details?ann_id=393577
/trade/trading_resources/listing_directory/company-profile?stock_code=0091
/market_information/announcements/company_announcement/announcement_details?ann_id=393747
/trade/trading_resources/listing_directory/company-profile?stock_code=7722
/market_information/announcements/company_announcement/announcement_details?ann_id=393581
/media-releases-rss.rss

Upvotes: 0

Views: 43

Answers (2)

DevGuyAhnaf
DevGuyAhnaf

Reputation: 149

Instead of using regex for something like this, it would be better to use the in operator to check if the link contains the substring.

You can do something like:

substring = "/market_information/announcements/company_announcement/announcement_details?ann_id="

title = soup.find_all('tbody')
for i in title:
    for link in i.find_all('a'):
        if substring in link:
            print(link) # This is the link that contained that substring

Upvotes: 0

Hammurabi
Hammurabi

Reputation: 1179

You can use regex, just escape the ? symbol with ? and use regex101.com to check your regex.

links = ['/trade/trading_resources/listing_directory/company-profile?stock_code=7374/market_information/announcements/company_announcement/announcement_details?ann_id=393738',
         'some_other_link']

for link in links:
    if re.search('/market_information/announcements/company_announcement/announcement_details\?ann_id=', link):
        use_this_link = True

Upvotes: 1

Related Questions