Iwan
Iwan

Reputation: 319

Using the author of files to utilize the files in Python

I'm trying to consider the best way of running through a directory tree and identifying specific excel files, then moving on to manipulating them in pandas.

I tried to identify the files I want by scanning for the names of the files, (data) but I realised it would be far more effective if I was able to identify files by their authors. How would I be able to redo the example below from searching 'data' to searching for the author of the file?

I added file.lower() in my example as some files might contain Data or DATA in the file name. If there is a better way of doing this, and if there is a good resource for learning more about manipulating files as described in my post, I would be grateful to hear about it.

import os
import shutil
for folderName, subfolders, filenames in os.walk(r'dir\Documents'):

        for file in filenames:
                file.lower()
                if 'data' in file:
                        try: shutil.copy(os.path.join(folderName, file), 'C:\\dir\ALL DATA')
                        except: 
                                print(folderName, file)

Upvotes: 1

Views: 814

Answers (2)

LeopoldVonBuschLight
LeopoldVonBuschLight

Reputation: 911

This will search the directory for excel files that have the creator_name in the excel file metadata:

import os
import zipfile, lxml.etree
import shutil

def find_excel_files(directory, creator_name):
    for folderName, subfolders, filenames in os.walk(directory):
        for f in filenames:
            if f.endswith('.xlsx'):
                f_path = os.path.join(folderName, f)
                f_creators = get_xlsx_creators(f_path)
                if creator_name in f_creators:
                    # One of the creators of the excel file matches creator_name
                    # Do something like copy the file somewhere...
                    print('Found a match: {}'.format(f_path))



def get_xlsx_creators(xlsx_path):
    # open excel file (xlsx files are just zipfiles)
    zf = zipfile.ZipFile(xlsx_path)
    # use lxml to parse the xml file we are interested in
    doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
    # retrieve creator
    ns={'dc': 'http://purl.org/dc/elements/1.1/'}
    creators = doc.xpath('//dc:creator', namespaces=ns)
    return [c.text for c in creators]



find_excel_files(r'C:\Users\ayb\Desktop\tt', 'Tim')

The code for the get_xlsx_creators function was taken from this SO answer: How to retrieve author of a office file in python?

Upvotes: 4

Alexander
Alexander

Reputation: 109536

from pwd import getpwuid

for file in filenames:
    author = getpwuid(os.stat(file).st_uid).pw_name
    if author == '...':
        ...

Upvotes: 1

Related Questions