Reputation: 319
I'm trying to consider the best way of running through a directory tree and identifying specific excel files, then moving on to manipulating them in pandas.
I tried to identify the files I want by scanning for the names of the files, (data) but I realised it would be far more effective if I was able to identify files by their authors. How would I be able to redo the example below from searching 'data' to searching for the author of the file?
I added file.lower() in my example as some files might contain Data or DATA in the file name. If there is a better way of doing this, and if there is a good resource for learning more about manipulating files as described in my post, I would be grateful to hear about it.
import os
import shutil
for folderName, subfolders, filenames in os.walk(r'dir\Documents'):
for file in filenames:
file.lower()
if 'data' in file:
try: shutil.copy(os.path.join(folderName, file), 'C:\\dir\ALL DATA')
except:
print(folderName, file)
Upvotes: 1
Views: 814
Reputation: 911
This will search the directory for excel files that have the creator_name in the excel file metadata:
import os
import zipfile, lxml.etree
import shutil
def find_excel_files(directory, creator_name):
for folderName, subfolders, filenames in os.walk(directory):
for f in filenames:
if f.endswith('.xlsx'):
f_path = os.path.join(folderName, f)
f_creators = get_xlsx_creators(f_path)
if creator_name in f_creators:
# One of the creators of the excel file matches creator_name
# Do something like copy the file somewhere...
print('Found a match: {}'.format(f_path))
def get_xlsx_creators(xlsx_path):
# open excel file (xlsx files are just zipfiles)
zf = zipfile.ZipFile(xlsx_path)
# use lxml to parse the xml file we are interested in
doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
# retrieve creator
ns={'dc': 'http://purl.org/dc/elements/1.1/'}
creators = doc.xpath('//dc:creator', namespaces=ns)
return [c.text for c in creators]
find_excel_files(r'C:\Users\ayb\Desktop\tt', 'Tim')
The code for the get_xlsx_creators function was taken from this SO answer: How to retrieve author of a office file in python?
Upvotes: 4
Reputation: 109536
from pwd import getpwuid
for file in filenames:
author = getpwuid(os.stat(file).st_uid).pw_name
if author == '...':
...
Upvotes: 1