kabp
kabp

Reputation: 73

Finding urls containing a specific string

I haven't used RegEx before, and everyone seems to agree that it's bad for webscraping and html in particular, but I'm not really sure how to solve my little challenge without.

I have a small Python scraper that opens 24 different webpages. In each webpage, there's links to other webpages. I want to make a simple solution that gets the links that I need and even though the webpages are somewhat similar, the links that I want are not.

The only common thing between the urls seems to be a specific string: 'uge' or 'Uge' (uge means week in Danish - and the week number changes every week, duh). It's not like the urls have a common ID or something like that I could use to target the correct ones each time.

I figure it would be possible using RegEx to go through the webpage and find all urls that has 'uge' or 'Uge' in them and then open them. But is there a way to do that using BS? And if I do it using RegEx, how would a possible solution look like?

For example, here are two of the urls I want to grab in different webpages:

http://www.domstol.dk/KobenhavnsByret/retslister/Pages/Uge45-Tvangsauktioner.aspx

http://www.domstol.dk/esbjerg/retslister/Pages/Straffesageruge32.aspx

Upvotes: 0

Views: 4018

Answers (3)

schesis
schesis

Reputation: 59198

Yes, you can do this with BeautifulSoup.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
# To find just 'Uge##' or 'uge##', as specified in the question:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("[Uu]ge\d+"))]
# To find without regard to case at all:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("(?i)uge\d+"))]

Upvotes: 1

CoffeeRain
CoffeeRain

Reputation: 4522

This should work... The RegEx uge\d\d? tells it to find "uge" followed by a digit, and possibly another one.

import re
for item in listofurls:
  l = re.findall("uge\d\d?", item, re.IGNORECASE):
  if l:
    print item #just do whatever you want to do when it finds it

Upvotes: 2

Willy
Willy

Reputation: 635

Or just use a simple for loop:

list_of_urls = ["""LIST GOES HERE"""]
for url in list_of_urls:
    if 'uge' in url.lower():
        # Code to execute

The regex expression would look something like: uge\d\d

Upvotes: 1

Related Questions