Reputation: 46405
Folks,
I am not an expert in regular expressions and I've searched Google for my problem but haven't found a solution. If anybody finds another SO post with same question, please feel free to point to that post.
Question:
I got a text file with much of the characters as html tags. These text files may contain PDF filename as shown below. I just want to extract all such PDF filenames with .pdf
extension. Note that these PDF filenames may come anywhere in the text document string, not only after <FILENAME>
prefix.
Example Text:
Example 1: <FILENAME>any_valid_characters_filename.pdf
Example 2: hello this is a good file abc-def_xyz-1.pdf
Note here <FILENAME>
is a valid (html) tag in my text document. I want to extract the filename any_valid_characters_filename.pdf
and abc-def_xyz-1.pdf
. These valid characters for PDF filename could be a-Z
, A-Z
, _
, -
, .
, 0-9
but not special characters like <
, >
etc.
What I have tried so far:
r'\b(\w+\.pdf)\b'
r'^\\(.+\\)*(.+)\.(.+)\.pdf$'
r'[^A-Za-z0-9_\.pdf]'
r'[\\/:"*?<>|]+\.pdf'
and bunch of other regex expressions but did not have success.
Any help would be appreciated. Thank you.
Upvotes: 1
Views: 3585
Reputation: 5958
Can this work?
\b[^\s<>]*?.pdf\b
It works for your examples: https://regexr.com/43b8q
Update for your new request that no space exist between <FILENAME>
and whatever.pdf
:
Use: \b(?<![<>][\s]|\w)[\w-]*?.pdf\b
example: https://regex101.com/r/O3kpQ4/2/
Upvotes: 1
Reputation: 1940
I think the following expression covers everything you mentioned:
r"([\w\d\-.]+\.pdf)"
As it matches any composition with a word character, a digit character, a -
symbol and a .
symbol followed by .pdf
.
Upvotes: 3