stackuser
stackuser

Reputation: 869

Python - regex split file name by 1 or more hyphens

This reads in file names and uses regex to check against a proper format. The problem is that the hyphens may not always be there, so re.split() could produce an unpredictable result which makes it difficult to "reconstruct" a proper string format afterwards, but I'm not ruling that method out. Another problem with split() is that the any whitespace remains afterwards, thereby negating any benefit after the string is reconstructed. So I tried another regex with finditer() and another findall() but these are still finding only the 1st 6 digits.

Here's an example of the proper filename (improper names have varying whitespace):

201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt

Here's some of what I've been trying, I'll spare you the rest of the mess (bigger program around it):

#res = re.findall(r"[\.\-]",f0)
res = [str(m.group(0)) for m in re.finditer(r'[^\-]', f0)]
if res: print res
else: print "error on %s"%res


s0 = "['@ @ @ @ @ @ @ @ 201308    ', '    (12345)', 'ABC 2233L', '007', 'course Name', 'last, first.txt']"
#s = f0.split('-'); s = s[0]; print "sssss  ",s#,type(s)

An example of a string with incorrect whitespace is:

201308-(82609)-MAC 2233-007-Methods of Calculus - Klingler, Lee.txt

The main goal is to take in the filenames (which could be totally wrong with any number of symbols,letters,digits,whitespace), and turn that into the proper format. Since you can't check for every possible error, I'm trying to at least fix the extra (or lacking) whitespace using these methods.

Upvotes: 0

Views: 381

Answers (1)

jhermann
jhermann

Reputation: 2101

This works by a simple principle, normalize any hyphen with either digit / non-alpha-numeric neighbors, or else non-digit / alpha-numeric ones.

>>> import re
>>> name = "201308-(82609)-MAC 2233-007-Methods of Calculus - Klingler, Lee.txt"
>>> re.sub(r"(?<=[0-9]) ?- ?(?=[^0-9a-zA-Z])", " - ", re.sub(r"(?<=[^0-9]) ?- ?(?=[0-9a-zA-Z])", " - ", name))
'201308 - (82609) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'

Upvotes: 1

Related Questions