Mr.Pebbles1961
Mr.Pebbles1961

Reputation: 21

Regular Expression for finding filenames with certain parameters but not others

First time poster. Not a programmer, but I dabble where it intersects with my work as a composer and sound designer. I apologize if this is unnecessarily detailed, but it's the best way I can explain it.

I've been working on reorganizing my massive collection of music loops and sound effects (nearly 100k files), which includes using more consistent file renaming. I've done a lot of the work all ready using a filesearch app and a renaming app, but neither accept regular expressions. I've since found apps that do and I've been able to improve my turnaround time, but I know it can be improved further.

My current task is to find files that contain a range of strings preceded by, and followed by, a combination of spaces, as well as ones that are preceded by only a single space and are at the end of the filename, before the file extension.

Filenames I am looking for contain:

A string containing an upper or lowercase letter A-G, possibly followed by a lowercase b or # or m or an uppercase M
String is also preceeded by one space and followed by another space, but *not* by two
Also find results where string is preceded by two spaces, but followed by only one
Also find the reverse: where string is preceded by one space, but followed by two
Also find results where string is at the end of the filename (before the file extension) and preceeded by one space, but not two.

Examples:

Strings can be things like: Am, F#, EM, Db, G
Closer Kit Am 140bpm.wav  (1spaceAm1space)
EXCLUDING Closer Kit  F#  140bpm.wav (2spacesF#2spaces)
Closer Kit  EM 140bpm.wav  (2spacesEM1space)
Closer Kit Db  140bpm.wav  (1spacesDb2spaces)
Closer Kit  140bpm G.wav  (spaceGfileextension)

I started with "search for files containing the string a space, an uppercase or lowercase letters A-G possibly followed by a lowercase b or m or # or an uppercase M, and a space":

\s[A-Ga-g][b|#|m|M]\s

This resulted, as I expected, in retrieving files like:

01 Brass Swell  Cm  60bpm.wav  (2spacesCm2spaces)
01 Evolving Pads  Below the City C 60bpm-24b.wav  (1spaceC1space)

But also returned:

125bpm  F# Vocal Chops.wav  (2spacesF#1space)
Brass  140bpm C.wav  (spaceCfileextension)

Where it started to fall apart was when I updated it to indicate one space on either side of the desired string, but not 2:

\s(?!\s{2})[A-G][b|#|m](?!\s{2})\s

This resulted in returning files with any number of spaces on either side of the string, including one and two, as well as some with the desired string at the end, before the file extension:

07 Woodwind—Clarinet  C  60bpm.wav  (2spacesC2spces)
128bpm  Bass Am 1.wav  (1spaceAm1space)
85 Pad Flutter 1 Am.wav  (1spaceAmfileextension)
125bpm  F# Pad.wav  (2spacesF#1space)

I'm assuming my problem has to do with how I'm expressing the EXCLUDE part of the formula? Or am I missing the use of some /'s or \'s? The search app will let me stack search parameters, which might be the best answer, but it only offers "match regex" and not "exclude regex", so I'm unsure how to deal with that.

Suggestions?

(I've looked at the discussions suggested by the algoritm, but they don't seem applicable.)

Upvotes: 1

Views: 70

Answers (2)

The fourth bird
The fourth bird

Reputation: 163297

You could match the whole filename without lookarounds and 3 alternations.

Note that \s matches more than only a space and could also match a newline.

^(?:.*\S)?(?:\s[A-Ga-g][b#mM]?\s\s?(?:\S|$)|\s\s?[A-Ga-g][b#mM]?\s(?:\S|$)|\s[A-Ga-g][b#mM]?\.[a-z]).*

The pattern matches:

  • ^ Start of string
  • (?:.*\S)? Optionally match until a non whitespace character
  • (?: Non capture group for the alternatives
    • \s Match a single whitespace char
    • [A-Ga-g][b#mM]? Match a char in the range of A-G or a-g, optionally match one of b # m M
    • \s\s? Match 1 or 2 whitespace chars
    • (?:\S|$) Match either a non whitespace char or assert the end of the string
    • | Or
    • \s\s?[A-Ga-g][b#mM]?\s(?:\S|$) The same pattern as before, but now the amount of whitespace chars is reversed
    • | Or
    • \s[A-Ga-g][b#mM]?\.[a-z] Match a single whitespace char, the allowed characters to match followed by a dot and a char a-z (or you can make that more specific to match the allowed extensions like \.wav\b etc..)
  • ) Close the non capture group
  • .* Match the rest of the line

See a regex demo

Or when a lookbehind assertion is also supported:

(?<!\s)(?!\s\s[A-Ga-g][b#mM]?\s\s)(?:\s?\s[A-Ga-g][b#mM]?\s?\s(?!\s)|\s[A-Ga-g][b#mM]?(?=\.[a-z]))
  • (?<!\s) Negative lookbehind, assert not a space to the left
  • (?!\s\s[A-Ga-g][b#mM]?\s\s) Negative lookahead, assert not the allowed characters to match between 2 whitespace chars on the left and the right
  • (?: Non capture group for the alternatives
    • \s?\s[A-Ga-g][b#mM]?\s?\s(?!\s) Match the allowed characters between the allowed whitespace char combinations and assert not a whitespace char to the right after it
    • | Or
    • \s[A-Ga-g][b#mM]?(?=\.[a-z]) Match a whitespace char, the allowed characters and assert a dot followed by a char a-z or a more specific pattern for the allowed extensions.
  • ) Close the non capture group

See a regex demo with \s and a regex demo with spaces

Upvotes: 1

user24714692
user24714692

Reputation: 4919

It's not clear what the expected outputs are. It seems you want to output the entire line and also capture those F#, Am, etc.:

^[^\r\n]*\s+\b((?:[A-Ga-g])[Mmb#])*\s+[\w]+\.(?:wav|mp[3-4])[^\r\n]*$

Code:

import re

s = """
Closer Kit Am 140bpm.wav  (1spaceAm1space)
EXCLUDING Closer Kit  F#  140bpm.wav (2spacesF#2spaces)
EXCLUDING Closer Kit  F#  140bpm.mp4 (2spacesF#2spaces)
EXCLUDING Closer Kit  F#  140bpm.mp3 (2spacesF#2spaces)
Closer Kit  EM 140bpm.wav  (2spacesEM1space)
Closer Kit Db  140bpm.wav  (1spacesDb2spaces)
Closer Kit  140bpm G.wav  (spaceGfileextension)
01 Brass Swell  Cm  60bpm.wav  (2spacesCm2spaces)
01 Evolving Pads  Below the City C 60bpm-24b.wav  (1spaceC1space)
125bpm  F# Vocal Chops.wav  (2spacesF#1space)
Brass  140bpm C.wav  (spaceCfileextension)
"""

p = r"(?m)^([^\r\n]*\s+\b((?:[A-Ga-g])[Mmb#])*\s+[\w]+\.(?:wav|mp[3-4])[^\r\n]*)$"

find_patterns = re.findall(p, s)
print(find_patterns)

Prints

[('Closer Kit Am 140bpm.wav (1spaceAm1space)', 'Am'), ('EXCLUDING Closer Kit F# 140bpm.wav (2spacesF#2spaces)', 'F#'), ('EXCLUDING Closer Kit F# 140bpm.mp4 (2spacesF#2spaces)', 'F#'), ('EXCLUDING Closer Kit F# 140bpm.mp3 (2spacesF#2spaces)', 'F#'), ('Closer Kit EM 140bpm.wav (2spacesEM1space)', 'EM'), ('Closer Kit Db 140bpm.wav (1spacesDb2spaces)', 'Db'), ('01 Brass Swell Cm 60bpm.wav (2spacesCm2spaces)', 'Cm')]

Note

You can change this pattern, based on your requirement:

(?m)^([^\r\n]*\s+\b((?:[A-Ga-g])[Mmb#])*\s+[\w]+\.(?:wav|mp[3-4])[^\r\n]*)$
  • (?m): multiline flag can be removed if you're dealing with single lines.

  • [\w]+\.(?:wav|mp[3-4]) helps to narrow down your search space, can be removed.

  • \s+\b((?:[A-Ga-g])[Mmb#]) seems to pull out the expected outputs.


You can loose up or tight up the pattern, if you want. A more loose pattern is:

^([^\r\n]*\s+\b((?:[A-Ga-g])[Mmb#]?)*\s+[^\r\n\s]+\.(?:wav|mp[3-4])[^\r\n]*)$

Code

import re

s = """
Closer Kit Am 140bpm.wav  (1spaceAm1space)
EXCLUDING Closer Kit  F#  140bpm.wav (2spacesF#2spaces)
EXCLUDING Closer Kit  F#  140bpm.mp4 (2spacesF#2spaces)
EXCLUDING Closer Kit  F#  140bpm.mp3 (2spacesF#2spaces)
Closer Kit  EM 140bpm.wav  (2spacesEM1space)
Closer Kit Db  140bpm.wav  (1spacesDb2spaces)
Closer Kit  140bpm G.wav  (spaceGfileextension)
01 Brass Swell  Cm  60bpm.wav  (2spacesCm2spaces)
01 Evolving Pads  Below the City C 60bpm-24b.wav  (1spaceC1space)
125bpm  F# Vocal Chops.wav  (2spacesF#1space)
Brass  140bpm C.wav  (spaceCfileextension)
"""

p = r"(?m)^([^\r\n]*\s+\b((?:[A-Ga-g])[Mmb#]?)*\s+[^\r\n\s]+\.(?:wav|mp[3-4])[^\r\n]*)$"

find_patterns = re.findall(p, s)
print(find_patterns)

Prints

[('Closer Kit Am 140bpm.wav (1spaceAm1space)', 'Am'), ('EXCLUDING Closer Kit F# 140bpm.wav (2spacesF#2spaces)', 'F#'), ('EXCLUDING Closer Kit F# 140bpm.mp4 (2spacesF#2spaces)', 'F#'), ('EXCLUDING Closer Kit F# 140bpm.mp3 (2spacesF#2spaces)', 'F#'), ('Closer Kit EM 140bpm.wav (2spacesEM1space)', 'EM'), ('Closer Kit Db 140bpm.wav (1spacesDb2spaces)', 'Db'), ('01 Brass Swell Cm 60bpm.wav (2spacesCm2spaces)', 'Cm'), ('01 Evolving Pads Below the City C 60bpm-24b.wav (1spaceC1space)', 'C')]

Note

  • [Mmb#]? makes this one character optional.
  • [^\r\n\s]+ is more loose pattern for the audio file name.
  • (?:wav|mp[3-4]) here you can add any other file extensions that you may have.

Upvotes: 1

Related Questions