P A N
P A N

Reputation: 5922

Grab part of filename with Python

Newbie here.

I've just been working with Python/coding for a few days, but I want to create a script that grabs parts of filenames corresponding to a certain pattern, and outputs it to a textfile.

So in my case, let's say I have four .pdf like this:

aaa_ID_8423.pdf
bbbb_ID_8852.pdf
ccccc_ID_7413.pdf
dddddd_ID_4421.pdf

(Note that they are of variable length.)

I want the script to go through these filenames, grab the string after "ID_" and before the filename extension.

Can you point me in the direction to which Python modules and possibly guides that could assist me?

Upvotes: 3

Views: 24956

Answers (5)

Paul Rigor
Paul Rigor

Reputation: 1026

Here's a simple solution using the re module as mentioned in other answers.

# Libraries
import re

# Example filenames. Use glob as described below to grab your pdf filenames
file_list = ['name_ID_123.pdf','name2_ID_456.pdf'] # glob.glob("*.pdf") 

for fname in file_list:
    res = re.findall("ID_(\d+).pdf", fname)
    if not res: continue
    print res[0] # You can append the result to a list

And below should be your output. You should be able to adapt this to other patterns.

# Output
123
456

Goodluck!

Upvotes: 8

suripoori
suripoori

Reputation: 331

You can use the os module in python and do a listdir to get a list of filenames present in that path like so:

import os
filenames = os.listdir(path)

Now you can iterate over the filenames list and look for the pattern which you need using regular expressions:

import re
for filename in filenames:
    m = re.search('(?<=ID_)\w+', filename)
    print (m)

The above snippet will return the part of the filename following ID_ and prints it out. So, for your example, it would return 4421.pdf, 8423.pdf etc. You can write a similar regex to remove the .pdf part.

Upvotes: 2

KCzar
KCzar

Reputation: 1044

If the numbers are variable length, you'll want the regex module "re"

import re

# create and compile a regex pattern
pattern = re.compile(r"_([0-9]+)\.[^\.]+$")

pattern.search("abc_ID_8423.pdf").group(1)
Out[23]: '8423'

Regex is generally used to match variable strings. The regex I just wrote says:

Find an underscore ("_"), followed by a variable number of digits ("[0-9]+"), followed by the last period in the string ("\.[^\.]+$")

Upvotes: 5

twalberg
twalberg

Reputation: 62369

Here's another alternative, using re.split(), which is probably closer to the spirit of exactly what you're trying to do (although solutions with re.match() and re.search(), among others, are just as valid, useful, and instructive):

>>> import re
>>> re.split("[_.]", "dddddd_ID_4421.pdf")[-2]
'4421'
>>> 

Upvotes: 6

Clarus
Clarus

Reputation: 2338

You probably want to use glob, which is a python module for file globbing. From the python help page the usage is as follows:

>>> import glob
>>> glob.glob('./[0-9].*')
['./1.gif', './2.txt']
>>> glob.glob('*.gif')
['1.gif', 'card.gif']
>>> glob.glob('?.gif')
['1.gif']

Upvotes: 0

Related Questions