Reputation: 5105
i need to search a fairly lengthy string for CPV (common procurement vocab) codes.
at the moment i'm doing this with a simple for loop and str.find()
the problem is, if the CPV code has been listed in a slightly different format, this algorithm won't find it.
what's the most efficient way of searching for all the different iterations of the code within the string? Is it simply a case of reformatting each of the up to 10,000 CPV codes and using str.find() for each instance?
An example of different formatting could be as follows
30124120-1
301241201
30124120 - 1
30124120 1
30124120.1
etc.
Thanks :)
Upvotes: 3
Views: 238
Reputation: 56694
cpv = re.compile(r'(\d{8})(?:[ -.\t/\\]*)(\d{1}\b)')
for m in re.finditer(cpv, ex):
cpval,chk = m.groups()
print("{0}-{1}".format(cpval,chk))
applied to your sample data returns
30124120-1
30124120-1
30124120-1
30124120-1
30124120-1
The regular expression can be read as
(\d{8}) # eight digits
(?: # followed by a sequence which does not get returned
[ -.\t/\\]* # consisting of 0 or more
) # spaces, hyphens, periods, tabs, forward- or backslashes
(\d{1}\b) # followed by one digit, ending at a word boundary
# (ie whitespace or the end of the string)
Hope that helps!
Upvotes: 1
Reputation: 363807
Try a regular expression:
>>> cpv = re.compile(r'([0-9]+[-\. ]?[0-9])')
>>> print cpv.findall('foo 30124120-1 bar 21966823.1 baz')
['30124120-1', '21966823.1']
(Modify until it matches the CPVs in your data closely.)
Upvotes: 4
Reputation: 76965
Try using any of the functions in re
(regular expressions for Python). See the docs for more info.
You can craft a regular expression to accept a number of different formats for these codes, and then use re.findall
or something similar to extract the information. I'm not certain what a CPV is so I don't have a regular expression for it (though maybe you could see if Google has any?)
Upvotes: 1