ParisNakitaKejser
ParisNakitaKejser

Reputation: 14979

regex problems need to match a part of a url

I'm trying in python to search in html pages.

I need to find somthing inside the pages alle links there have a kind of match and after that the hole url to return.

My link can look link this href="http://example.com/page/subpage/unik-id-12345" and i have trying to wirte a small regex to get a sample out.

href\=\"(.*)\">

The problem is its taken everything inside, and i can't find how i can search only somthing inside the href tag.

hope you understand and hope you can help me to fix this issue.

what i want its search after eg. example.com/page

Upvotes: 0

Views: 112

Answers (3)

user2201041
user2201041

Reputation:

Are you aware of regex101.com? It's a great tool for tweaking your regexes.

If I understand your problem right, you're matching href="http://example.com/page/subpage/unik-id-12345">, and you want to just get http://example.com/page/subpage/unik-id-12345

One way would be to just grab http(s)://, followed by anything that's not a quotation mark: http(s?):\/\/[^"]*

If you have multiple links, and only want the ones inside the href tag, you'd probably have to just use your regex, then use more operations to extract just the url. (e.g. match.split("\"")[1])

Or you could just use an HTML parser like BeautifulSoup

Upvotes: 1

Digisec
Digisec

Reputation: 710

import re
regex = re.compile('<href="(.*)">')
url = '<href="https://stackoverflow.com/">'
m = regex.search(url)

Then you can get the group

>>> m.group(0)
'<href="https://stackoverflow.com/">'
>>> m.group(1)
'https://stackoverflow.com/'

PS: if you are trying to do web scraping it would be easier to use libraries specifically designed for that like beautifulsoup. You can find tutorials easily on the web on how to use it.

Upvotes: 1

drjackild
drjackild

Reputation: 472

import re
s = 'href="http://example.com/page/subpage/unik-id-12345">'
res = re.search('href=\"(.+?)\">', s).group(1)
print(res)
# Output: http://example.com/page/subpage/unik-id-12345

Btw, better to use specific libraries, like lxml, for html parsing.

Upvotes: 3

Related Questions