Reputation: 39
My script searches for all the pdf files in a specific directory and then extracts an id from the pdf and organise the pdfs within the files. For example I have:
C:\Users\user\Downloads\aa\1.pdf, with id = 3,
C:\Users\user\Downloads\aa\2.pdf, with id = 5,
C:\Users\user\Downloads\aa\3.pdf, with id = 10
and I want to organize them like this:
C:\Users\user\Downloads\aa\3\1.pdf
C:\Users\user\Downloads\aa\5\2.pdf
C:\Users\user\Downloads\aa\10\3.pdf
The following script does the job, but I think only for the last file outputs the following error:
Traceback (most recent call last): File "C:\Users\user\Downloads\aa\project.py", line 74, in os.rename(source, dest) PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\user\Downloads\aa\3.pdf' -> 'C:\Users\user\Downloads\aa\10\3.pdf'
import PyPDF2
import re
import glob, os
import shutil
import sys
from collections import Counter
from collections import defaultdict
class DictList(dict):
def __setitem__(self, key, value):
try:
self[key].append(value)
except KeyError:
super(DictList, self).__setitem__(key, value)
except AttributeError:
super(DictList, self).__setitem__(key, [self[key], value])
files = glob.glob(r'C:\Users\user\Downloads\aa\*.pdf')
gesi_id=[]
dic = DictList()
c = 0
for i in files:
pdfFileObj = open(files[c],'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
keywords = []
keywords = re.findall(r'[0-9]\w+', text);
gesi_id.append(keywords[0])
key = str(gesi_id[c])
value = files[c]
dic[key] = value
c=c+1
gesi_id_unique = []
for x in gesi_id:
if x not in gesi_id_unique:
gesi_id_unique.append(x)
c=0
if not gesi_id_unique:
sys.exit()
for i in gesi_id_unique:
dirName = os.path.join('C:\\Users\\user\\Downloads\\aa\\',
str(gesi_id_unique[c]))
c=c+1
if not os.path.exists(dirName):
os.mkdir(dirName)
keys = list(dic)
values = list(dic.values())
k = 0
v = 0
for i in keys:
for val in values[k]:
source = val
dest = os.path.join('C:\\Users\\user\\Downloads\\aa\\',
gesi_id_unique[k], val.rsplit('\\', 1)[-1])
print(gesi_id_unique[k])
print(val.rsplit('\\', 1)[-1])
print("Source: %s" % source)
print("Dest: %s" % dest)
os.rename(source, dest)
k = k+1
Upvotes: 1
Views: 3438
Reputation: 21
First of all, I think that due to copy and past some indentations got disturbed, In fact there's a part that should be :
for i in files:
pdfFileObj = open(files[c],'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
keywords = []
keywords = re.findall(r'[0-9]\w+', text);
gesi_id.append(keywords[0])
key = str(gesi_id[c])
value = files[c]
dic[key] = value
c=c+1
And to solve the problem you just need to close the currectly used file by adding pdfFileObj.close()
at the in of this that it becomes :
for i in files:
pdfFileObj = open(files[c],'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
keywords = []
keywords = re.findall(r'[0-9]\w+', text);
gesi_id.append(keywords[0])
key = str(gesi_id[c])
value = files[c]
dic[key] = value
c=c+1
pdfFileObj.close()
Upvotes: 2