user2002214
user2002214

Reputation: 1

Python Regex Tokenize

I'm trying to figure out how to use regular expressions in Python to extract out certain URLs in strings. For example, I might have 'blahblahblah (a href="example.com")'. In this case I want to extract all "example.com" links. How can I do that instead of just splitting the string?

Thanks!

Upvotes: 0

Views: 723

Answers (3)

kiriloff
kiriloff

Reputation: 26333

Do not use regexp:

Here is why you should not think at regex in the first place when dealing with HTML or XML (or URLs).

If you wish to use regex anyway,

You can find several pattern that do the job, and several way to fetch the strings you wish to find.

These patterns do the job:

r'\(a href="(.*?)"\)'

r'\(a href="(.*)"\)'

r'\(a href="(+*)"\)'

1. re.findall()

re.findall(pattern, string, flags=0) 

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.

import re
st = 'blahblahblah (a href="example.com") another bla <a href="polymer.edu">'
re.findall(r'\(a href="(+*)"\)',s)

2. re.search()

re.search(pattern, string, flags=0)

Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding MatchObject instance.

Then, go with re.group() through groups. For instance, using regex r'\(a href="(.+?(.).+?)"\)', that is also working here, you have several enclosed groups: group 0 is a match to the whole pattern, group 1 is a match to the first enclosed sub-pattern surrounded with parenthesis, (.+?(.).+?)

You would use search when looking for first occurence of pattern only. And with your example this would be

>>> st = 'blahblahblah (a href="example.com") another bla (a href="polymer.edu")'
>>> m=re.search(r'\(a href="(.+?(.).+?)"\)', st)
>>> m.group(1)
'example.com'

Upvotes: 0

mac
mac

Reputation: 43031

Regex are very powerful tools, but they might not be your tool in all circumstances (as other has suggested already). That said, here's a minimal example from the console that uses - as per request - regex:

>>> import re
>>> s = 'blahblahblah (a href="example.com") another bla <a href="subdomain.example2.net">'
>>> re.findall(r'a href="(.*?)"', s)
['example.com', 'subdomain.example2.net']

Focus on r'a href="(.*?)"'. In Englis it translates in: "find a string beginning with a href=", then save as a result any character until you hit the next ". The syntax is:

  • the () means "save only stuff in here"
  • the . means "any character"
  • the * means "any number of times"
  • the ? means "non greedy" or in other terms: find the shortest string that satisfy the requirements (try without the question mark and you will see what happens).

HTH!

Upvotes: 0

TerryA
TerryA

Reputation: 59974

There is a great module called BeautifulSoup (link: http://www.crummy.com/software/BeautifulSoup/) which is great for parsing HTML. You should use this instead of using regex to get info from HTML. Here's an example of BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> html = """<p> some <a href="http://link.com">HTML</a> and <a href="http://second.com">another link</a></p>"""
>>> soup = BeautifulSoup(html)
>>> mylist = soup.find_all('a')
>>> for link in mylist:
...    print link['href']
http://link.com
http://second.com

Here is a link to the documentation, which is really easy to follow: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Upvotes: 1

Related Questions