Reputation: 869
Trying to write a regex in Python that will take filenames that are incorrectly formatted, and fix them. This works on some of the strings (f0-f5) but not others:
import re, os, sys
f0 = '201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'
f1 = '201308 - (12345) - ABC 2233L-007-course Name 1 - last, first.txt'
f2 = '@ @ @ @ @ @ 123 abc - a-1 - b-2.txt'
f3 = '201308-(82609) -MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'
f4 = '201308 - (12345)-ABC 2233L-007-course Name-last, first.txt'
f5 = '201308-(12345)-ABC 2233L-007-course Name-last, first.txt'
term = "201308"
res1 = re.search(term + '\s-\s\(\d{5}\)\s-\s\w{3}\s\d{4}\w?-\d{3}-[^\.]+\s-\s[^\.]+\.txt', f1)
r1 = re.sub(r"(?<=[0-9]) ?- ?(?=[^0-9a-zA-Z])", " - ", re.sub(r"(?<=[^0-9]) ?- ?(?=[0-9a-zA-Z])", " - ", f4))
r2 = re.sub(r"(?<=[0-9]) *- *(?=[^0-9a-zA-Z])", " - ", re.sub(r"(?<=[^0-9]) *- *(?=[0-9a-zA-Z])", " - ", f4)) # success!!
if r1: print r1
else: print "error on %s"%r1
if r2: print r2
else: print "error on %s"%r2
As you can see from f0-f5 (except f2 which is just an outlier), this is intended to take strings with improper whitespace between hyphens (which divide up the elements here), and reform it back together as shown in r1
and r2
. The res1
is the regex to match the properly formatted string.
Here's an example of the proper filename (improper names have varying whitespace):
201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt
For the same reason, re.split()
cannot be used because it will not normalize the whitespace between hyphens to whatever is needed at that point in the string.
f3 can be fixed by r2
, but f4 and f5 are really the kinds of errors that I need to correct for. The whitespace before and after hyphens (only) needs to match exactly the proper format.
EDIT:
Thank you to everyone who wrote in with a solution. You definitely taught me a lot today, and rest assured that you have single-handedly improved at least 1 person's ability to program. Icing on the cake is helping me with a problem that has frustrated me for a week now. Unfortunately, only 1 can be chosen accepted answer, that's the hard part.
Very close call, but used the regex only because time complexity may not be as big of an issue here with <500 files total (and divided into directories so not much to slow each pass of the program which is user input driven anyhow). Also, I just learned so much from the regex, my head is spinning from how much info I'm getting here.
Upvotes: 1
Views: 378
Reputation: 71548
Well, I think you can take the regex you currently have for the valid file name and tweak it a little:
\s*-\s*(\(\d{5}\))\s*-\s*(\w{3}\s\d{4}\w?-\d{3}-(?:[^.\s]|\b\s\b)+)\s*-\s*([^.]+\.txt)$
And use a replace of:
- \1 - \2 - \3
(there's also a space before the first hyphen)
I added some *
to the \s
and used (?:[^.\s]|\b\s\b)+)
(to allow for the spaces within the course name) instead of [^\.]
(note that the period in [^\.]
need not be escaped).
>>> f4 = '201308 - (12345)-ABC 2233L-007-course Name-last, first.txt'
>>> print(re.sub(r'\s*-\s*(\(\d{5}\))\s*-\s*(\w{3}\s\d{4}\w?-\d{3}-(?:[^.\s]|\b\s\b)+)\s*-\s*([^.]+\.txt)$', r' - \1 - \2 - \3', f4)
201308 - (82609) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt
Upvotes: 3
Reputation: 35891
For such a complicated task it would be simpler to write a simple parser. Even without writing parser something like this still seems easier to manage then regexps:
f0 = '201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'
f1 = '201308 - (12345) - ABC 2233L-007-course Name 1 - last, first.txt'
f2 = '@ @ @ @ @ @ 123 abc - a-1 - b-2.txt'
f3 = '201308-(82609) -MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'
f4 = '201308 - (12345)-ABC 2233L-007-course Name-last, first.txt'
f5 = '201308-(12345)-ABC 2233L-007-course Name-last, first.txt'
for f in [f0,f1,f2,f3,f4,f5]:
parts = f.split('-')
parts = [p.strip() for p in parts]
for i in range(0, len(parts)):
if i == 0 or i == 4:
parts[i] = parts[i] + ' '
elif i == 2:
parts[i] = ' ' + parts[i]
elif i != 3:
parts[i] = ' ' + parts[i] + ' '
result = "-".join(parts)
print result
Few of such ifs should work if input data is mostly similar to what you've presented.
Result:
201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt 201308 - (12345) - ABC 2233L-007-course Name 1 - last, first.txt @ @ @ @ @ @ 123 abc - a - 1-b-2.txt 201308 - (82609) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt 201308 - (12345) - ABC 2233L-007-course Name - last, first.txt 201308 - (12345) - ABC 2233L-007-course Name - last, first.txt
Upvotes: 2
Reputation: 36272
I would use findall()
instead of a sub()
, like:
re.findall(r'^(\d+)\s*-\s*(\(\d+\))\s*-\s*(.*?)\s*-\s*(\S+,.*)$', string)
A demostration:
import re, os, sys
f0 = '201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'
f1 = '201308 - (12345) - ABC 2233L-007-course Name 1 - last, first.txt'
f2 = '@ @ @ @ @ @ 123 abc - a-1 - b-2.txt'
f3 = '201308-(82609) -MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'
f4 = '201308 - (12345)-ABC 2233L-007-course Name-last, first.txt'
f5 = '201308-(12345)-ABC 2233L-007-course Name-last, first.txt'
for f in [f0, f1, f3, f4, f5]:
print(' - '.join(str(elem) for elem in re.findall(r'^(\d+)\s*-\s*(\(\d+\))\s*-\s*(.*?)\s*-\s*(\S+,.*)$', f)[0]))
It yields:
201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt
201308 - (12345) - ABC 2233L-007-course Name 1 - last, first.txt
201308 - (82609) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt
201308 - (12345) - ABC 2233L-007-course Name - last, first.txt
201308 - (12345) - ABC 2233L-007-course Name - last, first.txt
Upvotes: 1
Reputation: 3913
Does something like this work out for you? I didn't understand if the hyphens are the only thing bothering you.
' '.join(filter(lambda x: x!='', f4.replace('-',' - ').split(' ')))
Example:
>>> f1 = '201308 - (12345) - ABC 2233L-007-course Name 1 - last, first.txt'
>>> f2 = '@ @ @ @ @ @ 123 abc - a-1 - b-2.txt'
>>> f3 = '201308-(82609) -MAC 2233-007-Methods of Calculus - Klingler, Lee
.txt'
>>> f4 = '201308 - (12345)-ABC 2233L-007-course Name-last, first.txt'
>>> f5 = '201308-(12345)-ABC 2233L-007-course Name-last, first.txt'
>>> for i in [f0,f1,f2,f3,f4,f5]:
... print ' '.join(filter(lambda x: x!='', i.replace('-',' - ').split(' ')))
201308 - (82608) - MAC 2233 - 007 - Methods of Calculus - Klingler, Lee.txt
201308 - (12345) - ABC 2233L - 007 - course Name 1 - last, first.txt
@ @ @ @ @ @ 123 abc - a - 1 - b - 2.txt
201308 - (82609) - MAC 2233 - 007 - Methods of Calculus - Klingler, Lee.txt
201308 - (12345) - ABC 2233L - 007 - course Name - last, first.txt
201308 - (12345) - ABC 2233L - 007 - course Name - last, first.txt
Upvotes: 0