Reputation: 347
I've done a lot of searching and experiments already with this one! I am sure its probably very simple.
I use Python to scan the contents of PDFs to read the serial numbers of machines I buy in. All machines are wiped with specialized software and the serial number (used to be) the final 'word' in the first line of the document. Easy to extract without a regex. Now the software that does the wiping sometimes puts the serial on the second line of the PDFs. I guess I could use a bunch of chained 'if' clauses to get around the problem but I would rather use Regexes.
The machines I deal with are Dells, Macs and iPhones. Dells have 7 upper-case alphanumeric 'service tags'/serials. Macs and iPhones can be either 11 or 12 upper-case alphanumeric characters, depending on when they were manufactured.
This is what i have so far... I a little concerned about 'false positives' creeping in. The verification of the serial number is based on lenght. The date and time and software version of the software are also in the list of the serial results.
output = convert_pdf_to_txt(file_name)
# getting the serial by joining first two lines together, using a regex, adding to list, sorting list by len
#assigning the longest result to the serial variable. this limits the scope of the search. fewer results to annoy me
lines = output.splitlines()
firstLine = lines[0]
secondLine = lines[1]
docHeader = str(firstLine) + str(secondLine)
docHeader = str(docHeader)
#regex to find alphanumeric words
serialRegex = "(\w*\d[\w\d]+)"
serialResults = re.findall(serialRegex, docHeader)
sorted(serialResults, key=len)
serial = serialResults[-1]
The tests I have done so far have been in order, but its only a matter of time before Dell come up with some laptop whose (alphanumeric) model number is 8 characters long and thus supplants the serial... just thinking out loud.
An example of how the top of the PDF document reads... the serial (obscured) is on the second line here.
My version of Python (2.7) doest seem to give back any results when the regex starts with or ends these chars.
^ or $
In summary, how do I write a regex that returns results only when a string contains alphanumeric words of length 7, 11 or 12 uppercase chars?
Thanks. WL
Upvotes: 0
Views: 1147
Reputation: 637
As the offical doc shows, you can use {m,n}
to specify the repetitions to the match, \b
to match the beginning or end of a word:
re.findall(r'\b[A-Z0-9]{7,7}\b', docHeader)
11 and 12 repetitions can be done with the same idea.
re.findall(r'\b[A-Z0-9]{11,11}\b', docHeader)
re.findall(r'\b[A-Z0-9]{12,12}\b', docHeader)
Upvotes: 1