Brian
Brian

Reputation: 14836

Python, speed up regex expression for extracting sub strings

I have the following text

text = "This is a string with C1234567 and CM123456, CM123, F1234567 and also M1234, M123456"

And I would like to extract this list of substrings

['C1234567', 'CM123456', 'F1234567']

This is what I came up with

new_string = re.compile(r'\b(C[M0-9]\d{6}|[FM]\d{7})\b')
new_string.findall(text)

However, I was wondering if there's a way to do this faster since I'm interested in performing this operation tens of thousands of times.

I thought I could use ^ to match the beginning of string, but the regex expression I came up with

new_string = re.compile(r'\b(^C[M0-9]\d{6}|^[FM]\d{7})\b')

Doesn't return anything anymore. I know this is a very basic question, but I'm not sure how to use the ^ properly.

Upvotes: 1

Views: 490

Answers (1)

sniperd
sniperd

Reputation: 5274

Good and bad news. Bad news, regex looks pretty good, going to be hard to improve. Good news, I have some ideas :) I would try to do a little outside the box thinking if you are looking for performance. I do Extract Transform Load work, and a lot with Python.

  • You are already doing the re.compile (big help)
  • The regex engine is left to right, so short circuit where you can. Doesn't seem to apply here
  • If you have a big chunk of data that you are going to be looping over multiple times, clean it up front ONCE of stuff you KNOW won't match. Think of an HTML page, you only want stuff in HEAD stuff to get HEAD and need to run loops of many regexes over that section. Extract that section, only do that section, not the whole page. Seems obvious, isn't always :)
  • Use some metrics, give cProfile a try. Maybe there is some logic around where you are regexing that you can speed up. At least you can find your bottleneck, maybe the regex isn't the problem at all.

Upvotes: 2

Related Questions