Reputation: 517
I have thousands of PDF files in my computers which names are from a0001.pdf
to a3621.pdf
, and inside of each there is a title; e.g. "aluminum carbonate" for a0001.pdf
, "aluminum nitrate" in a0002.pdf
, etc., which I'd like to extract to rename my files.
I use this program to rename a file:
path=r"C:\Users\YANN\Desktop\..."
old='string 1'
new='string 2'
def rename(path,old,new):
for f in os.listdir(path):
os.rename(os.path.join(path, f), os.path.join(path, f.replace(old, new)))
rename(path,old,new)
I would like to know if there is/are solution(s) to extract the title embedded in the PDF file to rename the file?
Upvotes: 19
Views: 34010
Reputation: 1842
has some issues with defined solutions, here is my recipe
from pathlib import Path
from pdfrw import PdfReader
import re
path_to_files = Path(r"C:\Users\Malac\Desktop\articles\Downloaded")
# Exclude windows forbidden chars for name <>:"/\|?*
# Newlines \n and backslashes will be removed anyway
exclude_chars = '[<>:"/|?*]'
for i in path_to_files.glob("*.pdf"):
try:
title = PdfReader(i).Info.Title
except Exception:
# print(f"File {i} not renamed.")
pass
# Some names was just ()
if not title:
continue
# For some reason, titles are returned in brackets - remove brackets if around titles
if title.startswith("("):
title = title[1:]
if title.endswith(")"):
title = title[:-1]
title = re.sub(exclude_chars, "", title)
title = re.sub(r"\\", "", title)
title = re.sub("\n", "", title)
# Some names are just ()
if not title:
continue
try:
final_path = (path_to_files / title).with_suffix(".pdf")
if final_path.exists():
continue
i.rename(final_path)
except Exception:
# print(f"Name {i} incorrect.")
pass
Upvotes: 0
Reputation: 499
Building on Ciprian Tomoiagă's suggestion of using pdfrw, I've uploaded a script which also:
As TextGeek mentioned, unfortunately not all files have the title metadata, so some files won't be renamed.
Repository: https://github.com/favict/pdf_renamefy
After downloading the files, install the dependencies by running pip:
$pip install -r requirements.txt
and then to run the script:
$python -m renamefy <directory> <filename maximum length>
...in which directory is the full path you would like to look for PDF files, and filename maximum length is the length at which the filename will be truncated in case the title is too long or was incorrectly set in the file.
Both parameters are optional. If none is provided, the directory is set to the current directory and filename maximum length is set to 120 characters.
Example:
$python -m renamefy C:\Users\John\Downloads 120
I used it on Windows, but it should work on Linux too.
Feel free to copy, fork and edit as you see fit.
Upvotes: 0
Reputation: 2859
This cannot be solved with plain Python. You will need an external package such as pdfrw
, which allows you to read PDF metadata. The installation is quite easy using the standard Python package manager pip
.
On Windows, first make sure you have a recent version of pip
using the shell command:
python -m pip install -U pip
On Linux:
pip install -U pip
On both platforms, install then the pdfrw
package using
pip install pdfrw
I combined the ansatzes of zeebonk and user2125722 to write something very compact and readable which is close to your original code:
import os
from pdfrw import PdfReader
path = r'C:\Users\YANN\Desktop'
def renameFileToPDFTitle(path, fileName):
fullName = os.path.join(path, fileName)
# Extract pdf title from pdf file
newName = PdfReader(fullName).Info.Title
# Remove surrounding brackets that some pdf titles have
newName = newName.strip('()') + '.pdf'
newFullName = os.path.join(path, newName)
os.rename(fullName, newFullName)
for fileName in os.listdir(path):
# Rename only pdf files
fullName = os.path.join(path, fileName)
if (not os.path.isfile(fullName) or fileName[-4:] != '.pdf'):
continue
renameFileToPDFTitle(path, fileName)
Upvotes: 22
Reputation: 67
Once you have installed it, open the app and go to the Download folder. You will see your downloaded files there. Just long press the file you wish to rename and the Rename option will appear at the bottom.
Upvotes: -6
Reputation: 1328
You can use pdfminer library to parse the PDFs. The info property contains the Title of the PDF. Here is what a sample info looks like :
[{'CreationDate': "D:20170110095753+05'30'", 'Producer': 'PDF-XChange Printer `V6 (6.0 build 317.1) [Windows 10 Enterprise x64 (Build 10586)]', 'Creator': 'PDF-XChange Office Addin', 'Title': 'Python Basics'}]`
Then we can extract the Title using the properties of a dictionary. Here is the whole code (including iterating all the files and renaming them):
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
import os
start = "0000"
def convert(var):
while len(var) < 4:
var = "0" + var
return var
for i in range(1,3622):
var = str(i)
var = convert(var)
file_name = "a" + var + ".pdf"
fp = open(file_name, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fp.close()
metadata = doc.info # The "Info" metadata
print metadata
metadata = metadata[0]
for x in metadata:
if x == "Title":
new_name = metadata[x] + ".pdf"
os.rename(file_name,new_name)
Upvotes: 4