user2003052
user2003052

Reputation: 119

Python Regex to reorder parts of file name: removing a duplicate from name and spliting a group

I'm working on trying to rename multiple files using a regex in Python (3.8), reordering the file names for consistency. The aim is for a part number description to be moved to the beginning of the file if appropriate. Not all files contain a PNo.

Some example file name structures I am working with in my testing are shown below. I've tried to capture some of the possible variations on how things may have been entered previously.

Test Document One PNo 6477 Rev 2
Test Document TwoPno5555 - Rev 1
Test Document 3 PNo5343 rev 2
PNo 6478 - Test Document 4 Rev1
Test Document Five Pno 3333

For the most part, my regex works as desired, however there are two things I'd still like to achieve:

Documents two and four have an existing hyphen and these become duplicated when combining groups to create the new file name. I've tried adding [-] into the regex, but it breaks the third group, and I couldn't get that to work in files without a hyphen in their name. What is the best way to address this?

Second, when an existing part number does not have a space between alpha-numeric string I'd like to add it to the new file name. Can this be done using the existing python group somehow? I did consider splitting the Pno to two separate groups but thought the risk of 4 digits in other filenames (e.g.dates) would mess this up.

I'd be happy for some critique on what I've done here. This is my first attempt at writing a regex so if there's a better way, I'm all ears. Thx

PNoRegex = re.compile(r"""^(.*?)       
                   (PNo\s\d{4}|PNo\d{4}|Pno\s\d{4}|Pno\d{4})    # part number details
                   \s*             #remove white space after PNo string
                   (.*)$          # all text after Part No
                   """, re.VERBOSE)

for originalFile in os.listdir('.'):
    fileNameText = PNoRegex.search(originalFile)

# Skip files without a Regex match
if fileNameText == None:
    continue
# separate the groups
beforePNo   = fileNameText.group(1)
PNo         = fileNameText.group(2)
afterPNo    = fileNameText.group(3)

# Form the reordered filename.
newFileName = PNo + ' - ' + beforePNo + afterPNo

Edit: Screenshots added of the files.

List of files before regex operation

After performing operation

Upvotes: 2

Views: 147

Answers (2)

Ryszard Czech
Ryszard Czech

Reputation: 18631

Use re.sub:

re.sub(r'(?i)^(.*?)\s*(PNo)\s*(\d{4})\s*(?:-\s*)?(.*)$', r'\2 \3 - \1 \4', string)

See proof.

Explanation:

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (?i)                     set flags for this block (case-
                           insensitive) (with ^ and $ matching
                           normally) (with . not matching \n)
                           (matching whitespace and # normally)
--------------------------------------------------------------------------------
  ^                        the beginning of the string
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    PNo                      'PNo'
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    \d{4}                    digits (0-9) (4 times)
--------------------------------------------------------------------------------
  )                        end of \3
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    -                        '-'
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )?                       end of grouping
--------------------------------------------------------------------------------
  (                        group and capture to \4:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \4
--------------------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string

Upvotes: 0

The fourth bird
The fourth bird

Reputation: 163477

You can shorten the alternation to (P[nN]o)\s?(\d{4}) using a character class and matching an optional whitespace char.

You could use 2 capturing groups instead of 1 in case there is a space between pno and the digits.

To match the optional hyphen, you can extend matching either a whitespace char or a hyphen using a character class [-\s]*

This will result in separate groups for the parts in the current example data.

^(.*?)(P[nN]o)\s?(\d{4})[-\s]*(.*)$

Regex demo

Upvotes: 1

Related Questions