significance
significance

Reputation: 5105

most efficient way to go about identifying sub-strings in a string in python?

i need to search a fairly lengthy string for CPV (common procurement vocab) codes.

at the moment i'm doing this with a simple for loop and str.find()

the problem is, if the CPV code has been listed in a slightly different format, this algorithm won't find it.

what's the most efficient way of searching for all the different iterations of the code within the string? Is it simply a case of reformatting each of the up to 10,000 CPV codes and using str.find() for each instance?

An example of different formatting could be as follows

30124120-1 
301241201 
30124120 - 1
30124120 1
30124120.1

etc.

Thanks :)

Upvotes: 3

Views: 238

Answers (3)

Hugh Bothwell
Hugh Bothwell

Reputation: 56694

cpv = re.compile(r'(\d{8})(?:[ -.\t/\\]*)(\d{1}\b)')

for m in re.finditer(cpv, ex):
    cpval,chk = m.groups()
    print("{0}-{1}".format(cpval,chk))

applied to your sample data returns

30124120-1
30124120-1
30124120-1
30124120-1
30124120-1

The regular expression can be read as

(\d{8})         # eight digits

(?:             # followed by a sequence which does not get returned
  [ -.\t/\\]*   #   consisting of 0 or more
)               #   spaces, hyphens, periods, tabs, forward- or backslashes

(\d{1}\b)       # followed by one digit, ending at a word boundary
                #   (ie whitespace or the end of the string)

Hope that helps!

Upvotes: 1

Fred Foo
Fred Foo

Reputation: 363807

Try a regular expression:

>>> cpv = re.compile(r'([0-9]+[-\. ]?[0-9])')
>>> print cpv.findall('foo 30124120-1 bar 21966823.1 baz')
['30124120-1', '21966823.1']

(Modify until it matches the CPVs in your data closely.)

Upvotes: 4

Rafe Kettler
Rafe Kettler

Reputation: 76965

Try using any of the functions in re (regular expressions for Python). See the docs for more info.

You can craft a regular expression to accept a number of different formats for these codes, and then use re.findall or something similar to extract the information. I'm not certain what a CPV is so I don't have a regular expression for it (though maybe you could see if Google has any?)

Upvotes: 1

Related Questions