BBedit
BBedit

Reputation: 8067

How to combine multiple regular expressions into one line?

My script works fine doing this:

images = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", doc)
videos = re.findall("\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)", doc)

However, I believe it is inefficient to search through the whole document twice.

Here's a sample document if it helps: http://pastebin.com/5kRZXjij

I would expect the following output from the above:

images = http://37.media.tumblr.com/tumblr_lnmh4tD3sM1qi02clo1_500.jpg
videos = http://bassrx.tumblr.com/video_file/86319903607/tumblr_lo8i76CWSP1qi02cl

Instead it would be better to do something like:

image_and_video_links = re.findall(" <match-image-links-or-video links> ", doc)

How can I combine the two re.findall lines into one?

I have tried using the | character but I always fail to match anything. So I'm sure I'm completely confused as to how to use it properly.

Upvotes: 2

Views: 8609

Answers (2)

Paedolos
Paedolos

Reputation: 280

As mentioned in the comments, a pipe (|) should do the trick.

The regular expression

(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))

catches either of the two patterns.

Demo on Regex Tester

Upvotes: 6

zx81
zx81

Reputation: 41848

If you really want efficient...

For starters, I would cut out the \S*? in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.

src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)

Other ideas

You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:

(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*

Do you intend for the periods after src and media to mean "any character", or to mean "a literal period"? If the latter, escape them: \.

You can use the re.IGNORECASE option and get rid of some letters:

(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*

Upvotes: 1

Related Questions