Akira
Akira

Reputation: 2870

How to modify this regular expression to extract strings with this pattern?

I'm trying to extract the string that are between the quotation mark " and .pdf. For example, "../matlab/license_admin.pdf" abc "vfv" -> ../matlab/license_admin.pdf and "license_admin.pdf" xyz' -> license_admin.pdf. I try the following code:

import re

base = '"../matlab/license_admin.pdf" abc "vfv"'
base1 = '"license_admin.pdf" xyz'

result = re.findall(r'\b(\S+\.pdf)\b', base)
result1 = re.findall(r'\b(\S+\.pdf)\b', base1) 

print(result)
print(result1)

but it only works with the my second example. The code remove ../ in my first one:

enter image description here

Could you please help me modify the regular expression \b(\S+\.pdf)\b to achieve my goal? Thank you so much!

Upvotes: 1

Views: 41

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626926

Use

import re

bases = ['"../matlab/license_admin.pdf" abc "vfv"', '"license_admin.pdf" xyz']
for base in bases:
    m = re.search(r'"(.*?\.pdf)', base)
    if m:
        print(m.group(1))

See the Python demo

Output:

../matlab/license_admin.pdf
license_admin.pdf

The "(.*?\.pdf) pattern matches ", then captures into Group 1 any 0 or more chars but line break chars, as few as possible, and then .pdf. With re.search, you get the first match, and m.group(1) acccesses the Group 1 value.

See the regex demo.

Upvotes: 1

Related Questions