stackuser
stackuser

Reputation: 869

Python - regex on input string to format whitespace in filenames

Trying to write a regex in Python that will take filenames that are incorrectly formatted, and fix them. This works on some of the strings (f0-f5) but not others:

import re, os, sys

f0 = '201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'
f1 = '201308 - (12345) - ABC 2233L-007-course Name 1 - last, first.txt'
f2 = '@ @ @ @ @ @ 123 abc - a-1 - b-2.txt'
f3 = '201308-(82609)     -MAC 2233-007-Methods of Calculus  -  Klingler, Lee.txt'
f4 = '201308    -    (12345)-ABC 2233L-007-course Name-last, first.txt'
f5 = '201308-(12345)-ABC 2233L-007-course Name-last, first.txt'

term = "201308"
res1 = re.search(term + '\s-\s\(\d{5}\)\s-\s\w{3}\s\d{4}\w?-\d{3}-[^\.]+\s-\s[^\.]+\.txt', f1)
r1 = re.sub(r"(?<=[0-9]) ?- ?(?=[^0-9a-zA-Z])", " - ", re.sub(r"(?<=[^0-9]) ?- ?(?=[0-9a-zA-Z])", " - ", f4))
r2 = re.sub(r"(?<=[0-9]) *- *(?=[^0-9a-zA-Z])", " - ", re.sub(r"(?<=[^0-9]) *- *(?=[0-9a-zA-Z])", " - ", f4)) # success!!
if r1: print r1
else: print "error on %s"%r1
if r2: print r2
else: print "error on %s"%r2

As you can see from f0-f5 (except f2 which is just an outlier), this is intended to take strings with improper whitespace between hyphens (which divide up the elements here), and reform it back together as shown in r1 and r2. The res1 is the regex to match the properly formatted string.

Here's an example of the proper filename (improper names have varying whitespace):

201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt

For the same reason, re.split() cannot be used because it will not normalize the whitespace between hyphens to whatever is needed at that point in the string.

f3 can be fixed by r2, but f4 and f5 are really the kinds of errors that I need to correct for. The whitespace before and after hyphens (only) needs to match exactly the proper format.

EDIT:

Thank you to everyone who wrote in with a solution. You definitely taught me a lot today, and rest assured that you have single-handedly improved at least 1 person's ability to program. Icing on the cake is helping me with a problem that has frustrated me for a week now. Unfortunately, only 1 can be chosen accepted answer, that's the hard part.

Very close call, but used the regex only because time complexity may not be as big of an issue here with <500 files total (and divided into directories so not much to slow each pass of the program which is user input driven anyhow). Also, I just learned so much from the regex, my head is spinning from how much info I'm getting here.

Upvotes: 1

Views: 378

Answers (4)

Jerry
Jerry

Reputation: 71548

Well, I think you can take the regex you currently have for the valid file name and tweak it a little:

\s*-\s*(\(\d{5}\))\s*-\s*(\w{3}\s\d{4}\w?-\d{3}-(?:[^.\s]|\b\s\b)+)\s*-\s*([^.]+\.txt)$

And use a replace of:

 - \1 - \2 - \3

(there's also a space before the first hyphen)

I added some * to the \s and used (?:[^.\s]|\b\s\b)+) (to allow for the spaces within the course name) instead of [^\.] (note that the period in [^\.] need not be escaped).

regex101 demo.

>>> f4 = '201308    -    (12345)-ABC 2233L-007-course Name-last, first.txt'
>>> print(re.sub(r'\s*-\s*(\(\d{5}\))\s*-\s*(\w{3}\s\d{4}\w?-\d{3}-(?:[^.\s]|\b\s\b)+)\s*-\s*([^.]+\.txt)$', r' - \1 - \2 - \3', f4)
201308 - (82609) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt

Upvotes: 3

BartoszKP
BartoszKP

Reputation: 35891

For such a complicated task it would be simpler to write a simple parser. Even without writing parser something like this still seems easier to manage then regexps:

f0 = '201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'
f1 = '201308 - (12345) - ABC 2233L-007-course Name 1 - last, first.txt'
f2 = '@ @ @ @ @ @ 123 abc - a-1 - b-2.txt'
f3 = '201308-(82609)     -MAC 2233-007-Methods of Calculus  -  Klingler, Lee.txt'
f4 = '201308    -    (12345)-ABC 2233L-007-course Name-last, first.txt'
f5 = '201308-(12345)-ABC 2233L-007-course Name-last, first.txt'

for f in [f0,f1,f2,f3,f4,f5]:
    parts = f.split('-')
    parts = [p.strip() for p in parts]
    for i in range(0, len(parts)):
        if i == 0 or i == 4:
            parts[i] = parts[i] + ' '
        elif i == 2:
            parts[i] = ' ' + parts[i]
        elif i != 3:
            parts[i] = ' ' + parts[i] + ' '

    result = "-".join(parts)

    print result

Few of such ifs should work if input data is mostly similar to what you've presented.

Result:

201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt 
201308 - (12345) - ABC 2233L-007-course Name 1 - last, first.txt 
@ @ @ @ @ @ 123 abc - a - 1-b-2.txt 
201308 - (82609) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt 
201308 - (12345) - ABC 2233L-007-course Name - last, first.txt 
201308 - (12345) - ABC 2233L-007-course Name - last, first.txt 

Upvotes: 2

Birei
Birei

Reputation: 36272

I would use findall() instead of a sub(), like:

re.findall(r'^(\d+)\s*-\s*(\(\d+\))\s*-\s*(.*?)\s*-\s*(\S+,.*)$', string)

A demostration:

import re, os, sys

f0 = '201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt'
f1 = '201308 - (12345) - ABC 2233L-007-course Name 1 - last, first.txt'
f2 = '@ @ @ @ @ @ 123 abc - a-1 - b-2.txt'
f3 = '201308-(82609)     -MAC 2233-007-Methods of Calculus  -  Klingler, Lee.txt'
f4 = '201308    -    (12345)-ABC 2233L-007-course Name-last, first.txt'
f5 = '201308-(12345)-ABC 2233L-007-course Name-last, first.txt'

for f in [f0, f1, f3, f4, f5]:
    print(' - '.join(str(elem) for elem in re.findall(r'^(\d+)\s*-\s*(\(\d+\))\s*-\s*(.*?)\s*-\s*(\S+,.*)$', f)[0]))

It yields:

201308 - (82608) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt
201308 - (12345) - ABC 2233L-007-course Name 1 - last, first.txt
201308 - (82609) - MAC 2233-007-Methods of Calculus - Klingler, Lee.txt
201308 - (12345) - ABC 2233L-007-course Name - last, first.txt
201308 - (12345) - ABC 2233L-007-course Name - last, first.txt

Upvotes: 1

Ofir Israel
Ofir Israel

Reputation: 3913

Does something like this work out for you? I didn't understand if the hyphens are the only thing bothering you.

' '.join(filter(lambda x: x!='', f4.replace('-',' - ').split(' ')))

Example:

>>> f1 = '201308 - (12345) - ABC 2233L-007-course Name 1 - last, first.txt'
>>> f2 = '@ @ @ @ @ @ 123 abc - a-1 - b-2.txt'
>>> f3 = '201308-(82609)     -MAC 2233-007-Methods of Calculus  -  Klingler, Lee
.txt'
>>> f4 = '201308    -    (12345)-ABC 2233L-007-course Name-last, first.txt'
>>> f5 = '201308-(12345)-ABC 2233L-007-course Name-last, first.txt'

>>> for i in [f0,f1,f2,f3,f4,f5]:
...     print ' '.join(filter(lambda x: x!='', i.replace('-',' - ').split(' ')))
201308 - (82608) - MAC 2233 - 007 - Methods of Calculus - Klingler, Lee.txt
201308 - (12345) - ABC 2233L - 007 - course Name 1 - last, first.txt
@ @ @ @ @ @ 123 abc - a - 1 - b - 2.txt
201308 - (82609) - MAC 2233 - 007 - Methods of Calculus - Klingler, Lee.txt
201308 - (12345) - ABC 2233L - 007 - course Name - last, first.txt
201308 - (12345) - ABC 2233L - 007 - course Name - last, first.txt

Upvotes: 0

Related Questions