Trent
Trent

Reputation: 1275

Python regular expressions matching within set

While testing on http://gskinner.com/RegExr/ (online regex tester), the regex [jpg|bmp] returns results when either jpg or bmp exist, however, when I run this regex in python, it only return j or b. How do I make the regex take the whole word "jpg" or "bmp" inside the set ? This may have been asked before however I was not sure how to structure question to find the answer. Thanks !!!

Here is the whole regex if it helps

"http://www\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)"

Its just basically to look for pictures in a url

Upvotes: 6

Views: 13342

Answers (3)

darnzen
darnzen

Reputation: 81

If you are searching a list of URLs

urls = [ 'http://some.link.com/path/to/file.jpg',
         'http://some.link.com/path/to/another.png',
         'http://and.another.place.com/path/to/not-image.txt',
       ]

to find ones that match a given pattern you can use:

import re
for url in urls:
   if re.match(r'http://.*(jpg|png|gif)$'):
      print url

which will output

http://some.link.com/path/to/file.jpg
http://some.link.com/path/to/another.png

re.match() will test for a match at the start of the string and return a match object for the first two links, and None for the third.

If you are getting just the extension, you can use the following:

for url in urls:
   m = re.match(r'http://.*(jpg|png|gif)$')
   print m.group(0)

which will print

('jpg',)
('png',)

You will get just the extensions because that's what was defined as a group.

If you need to find the url in a long string of text (such as returned from wget), you need to use re.search() and enclose the part you are interested in with ( )'s. For example,

response = """dlkjkd dkjfadlfjkd fkdfl kadfjlkadfald ljkdskdfkl adfdf
    kjakldjflkhttp://some.url.com/path/to/file.jpgkaksdj fkdjakjflakdjfad;kadj af
    kdlfjd dkkf aldfkaklfakldfkja df"""

reg = re.search(r'(http:.*/(.*\.(jpg|png|gif)))', response)

print reg.groups()

will print

('http://some.url.com/path/to/file.jpg', 'file.jpg', 'jpg',)

or you can use re.findall or re.finditer in place of re.search to get all of the URL's in the long response. Search will only return the first one.

Upvotes: 0

stema
stema

Reputation: 93026

When you are using [] your are creating a character class that contains all characters between the brackets.

So your are not matching for jpg or bmp you are matching for either a j or a p or a g or a | ...

You should add an anchor for the end of the string to your regex

http://www\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)$
          ^      ^^

if you need double escaping then every where in your pattern

http://www\\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)$

to ensure that it checks for the file ending at the very end of the string.

Upvotes: 3

MByD
MByD

Reputation: 137382

Use (jpg|bmp) instead of square brackets.

Square brackets mean - match a character from the set in the square brackets.

Edit - you might want something like that: [^ ].*?(jpg|bmp) or [^ ].*?\.(jpg|bmp)

Upvotes: 5

Related Questions