Reputation: 329
I'm trying to capture two specific words/sequence of characters in a filename in a folder. What I have so far gives correct output on https://regex101.com/ but not in my script I'm working with.
This is the type of filenames I'm working with:
Bjørn Stallaresvei s 10013.pdf
or
Københavngaten 1 L. 8.pdf
And this is the regex I've come up with so far:
((?<=\s)[a-zA-Z\.]+(?=[\s0-9]+\.pdf))|((?<=\s)[0-9]+(?=.pdf))
I'm trying to capture in the first line 's' and '10013' - where 's' is the identifier and 10013 is the ID.
Same in the second line, L. is the identifier and 8 is the ID.
This is just an example code to show:
import re
string_1 = "Stallaresvei s 10013.pdf"
regexp = r"(((?<=\s)[a-zA-Z\.]+(?=[\s0-9]+\.pdf))|((?<=\s)[0-9]+(?=.pdf)))"
m = re.search(regexp, string_1)
print(m)
And the output only displays one match found:
<_sre.SRE_Match object; span=(13, 14), match='s'>
Upvotes: 1
Views: 63
Reputation: 626738
You may remove the capturing parentheses and use your regex with re.findall
:
r'(?<=\s)[a-zA-Z.]+(?=[\s0-9]+\.pdf)|(?<=\s)[0-9]+(?=\.pdf)'
See the online Python 3 demo:
import re
string_1 = "Stallaresvei s 10013.pdf"
regexp = r"(?<=\s)[a-zA-Z.]+(?=[\s0-9]+\.pdf)|(?<=\s)[0-9]+(?=\.pdf)"
m = re.findall(regexp, string_1)
print(m) # => ['s', '10013']
Another way is to rewrite the pattern and capture these bits into 2 groups, see another demo:
import re
string_1 = "Stallaresvei s 10013.pdf"
regexp = r"\s([a-zA-Z.]+)\s+([0-9]+)\.pdf"
m = re.search(regexp, string_1)
if m:
print([m.group(1), m.group(2)])
Here,
\s
- matches a whitespace([a-zA-Z.]+)
- Capturing group 1 matches 1+ ASCII letters or .
\s+
- 1+ whitespaces([0-9]+)
- Capturing group 2 matches 1+ ASCII digits\.pdf
- just matches .pdf
substring.Upvotes: 2