juliomalegria
juliomalegria

Reputation: 24921

Searching images files with regular expressions

I have a text file that looks like this:

[22/Nov/2011 12:57:58] "GET /media/js/jquery-1.4.3.min.js HTTP/1.1" 304 0
[22/Nov/2011 12:57:58] "GET /media/js/fancybox/fancybox-x.png HTTP/1.1" 304 0
[22/Nov/2011 12:57:59] "GET /media/js/fancybox/fancybox-y.png HTTP/1.1" 304 0
[22/Nov/2011 12:57:59] "GET /media/js/fancybox/blank.gif HTTP/1.1" 304 0
[22/Nov/2011 12:57:59] "GET /ajax/pages/erlebnisse/ HTTP/1.1" 200 563
[22/Nov/2011 12:58:00] "GET /erlebnisse/alle-erlebnisse/ HTTP/1.1" 200 17114

I want to use regular expressions to get all the image files (.gif, .jpg, .png) that appear here. So the result from the text above should be:

['fancybox-x.png', 'fancybox-y.png', 'blank.gif']

What I did was:

re.findall('\w+\.(jpg|gif|png)', f.read())

So the pattern is:

1 or more word-characters (\w+) followed by a dot (\.) and then 'jpg', 'gif' or 'png' (jpg|gif|png).

This actually works, but confuses the content of the parentheses (which I'm using only for "grouping") as a group(1), so the result is:

['png', 'png', 'gif']

With is right, but incomplete. In other words, I'm asking, how can I make re.findall() distinguish between "grouping" parentheses and parentheses to assign groups?

Upvotes: 2

Views: 7190

Answers (3)

Andrew Walker
Andrew Walker

Reputation: 42490

You're looking for non-capturing version of regular parentheses (?:...). The description is available in the re module docs.

s ='''[22/Nov/2011 12:57:58] "GET /media/js/jquery-1.4.3.min.js HTTP/1.1" 304 0
[22/Nov/2011 12:57:58] "GET /media/js/fancybox/fancybox-x.png HTTP/1.1" 304 0
[22/Nov/2011 12:57:59] "GET /media/js/fancybox/fancybox-y.png HTTP/1.1" 304 0
[22/Nov/2011 12:57:59] "GET /media/js/fancybox/blank.gif HTTP/1.1" 304 0
[22/Nov/2011 12:57:59] "GET /ajax/pages/erlebnisse/ HTTP/1.1" 200 563
[22/Nov/2011 12:58:00] "GET /erlebnisse/alle-erlebnisse/ HTTP/1.1" 200 17114'''

import re

for m in re.findall('([-\w]+\.(?:jpg|gif|png))', s):
    print m

Upvotes: 3

Chen Xing
Chen Xing

Reputation: 1705

You can just add another pair of parentheses, and put ?: for the inner one

re.findall('/([^/]+\.(?:jpg|gif|png))', f.read())

Note that \w won't match "-", so I would suggest [^/]+

Upvotes: 3

Godwin
Godwin

Reputation: 9927

If you're looking for the entire match you should be able to find it in group 0, otherwise you can add extra parentheses if you're looking for another part of the string.

Upvotes: 0

Related Questions