patski
patski

Reputation: 329

Searching for specific files using regex

I'm trying to capture two specific words/sequence of characters in a filename in a folder. What I have so far gives correct output on https://regex101.com/ but not in my script I'm working with.

This is the type of filenames I'm working with:

Bjørn Stallaresvei s 10013.pdf

or

Københavngaten 1 L. 8.pdf

And this is the regex I've come up with so far:

((?<=\s)[a-zA-Z\.]+(?=[\s0-9]+\.pdf))|((?<=\s)[0-9]+(?=.pdf))

I'm trying to capture in the first line 's' and '10013' - where 's' is the identifier and 10013 is the ID.

Same in the second line, L. is the identifier and 8 is the ID.

This is just an example code to show:

import re

string_1 = "Stallaresvei s 10013.pdf"

regexp = r"(((?<=\s)[a-zA-Z\.]+(?=[\s0-9]+\.pdf))|((?<=\s)[0-9]+(?=.pdf)))"
m = re.search(regexp, string_1)

print(m)

And the output only displays one match found:

<_sre.SRE_Match object; span=(13, 14), match='s'>

Upvotes: 1

Views: 63

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You may remove the capturing parentheses and use your regex with re.findall:

r'(?<=\s)[a-zA-Z.]+(?=[\s0-9]+\.pdf)|(?<=\s)[0-9]+(?=\.pdf)'

See the online Python 3 demo:

import re
string_1 = "Stallaresvei s 10013.pdf"
regexp = r"(?<=\s)[a-zA-Z.]+(?=[\s0-9]+\.pdf)|(?<=\s)[0-9]+(?=\.pdf)"
m = re.findall(regexp, string_1)
print(m) # => ['s', '10013']

Another way is to rewrite the pattern and capture these bits into 2 groups, see another demo:

import re
string_1 = "Stallaresvei s 10013.pdf"
regexp = r"\s([a-zA-Z.]+)\s+([0-9]+)\.pdf"
m = re.search(regexp, string_1)
if m:
    print([m.group(1), m.group(2)])

Here,

  • \s - matches a whitespace
  • ([a-zA-Z.]+) - Capturing group 1 matches 1+ ASCII letters or .
  • \s+ - 1+ whitespaces
  • ([0-9]+) - Capturing group 2 matches 1+ ASCII digits
  • \.pdf - just matches .pdf substring.

Upvotes: 2

Related Questions