Saurabh Gokhale
Saurabh Gokhale

Reputation: 46405

Extract filename from text file using regex

Folks,

I am not an expert in regular expressions and I've searched Google for my problem but haven't found a solution. If anybody finds another SO post with same question, please feel free to point to that post.

Question:

I got a text file with much of the characters as html tags. These text files may contain PDF filename as shown below. I just want to extract all such PDF filenames with .pdf extension. Note that these PDF filenames may come anywhere in the text document string, not only after <FILENAME> prefix.

Example Text:

Example 1: <FILENAME>any_valid_characters_filename.pdf
Example 2: hello this is a good file abc-def_xyz-1.pdf

Note here <FILENAME> is a valid (html) tag in my text document. I want to extract the filename any_valid_characters_filename.pdf and abc-def_xyz-1.pdf. These valid characters for PDF filename could be a-Z, A-Z, _, -, ., 0-9 but not special characters like <, > etc.

What I have tried so far:

r'\b(\w+\.pdf)\b'
r'^\\(.+\\)*(.+)\.(.+)\.pdf$'
r'[^A-Za-z0-9_\.pdf]' 
r'[\\/:"*?<>|]+\.pdf'

and bunch of other regex expressions but did not have success.

Any help would be appreciated. Thank you.

Upvotes: 1

Views: 3585

Answers (2)

Rocky Li
Rocky Li

Reputation: 5958

Can this work?

\b[^\s<>]*?.pdf\b

It works for your examples: https://regexr.com/43b8q

Update for your new request that no space exist between <FILENAME> and whatever.pdf:

Use: \b(?<![<>][\s]|\w)[\w-]*?.pdf\b

example: https://regex101.com/r/O3kpQ4/2/

Upvotes: 1

Aurora Wang
Aurora Wang

Reputation: 1940

I think the following expression covers everything you mentioned:

r"([\w\d\-.]+\.pdf)"

As it matches any composition with a word character, a digit character, a - symbol and a . symbol followed by .pdf.

Upvotes: 3

Related Questions