Morgan Nilsen
Morgan Nilsen

Reputation: 11

Open all underlying files from tree structure

I have loads of html files that I want to merge to one single file. File path is /Desktop/Username/My_files/. This folder contains 1300 different folders, and all of these folders has message.html file.

Instead of copying them one by one I want to solve this using Python. My code works if the message.html is in the folder, but is unable to read the contents of the underlying folder structure. The bolded part of the code needs changing, but how can it be corrected most easily?

import re, sys, glob
out = open("cleaned.txt", 'r')
**path = '/Home/Username/Desktop/My_files/*.html'**
files = glob.glob(path)
for file in files:
        f = open(file, 'r')
        data = f.read().replace("\n", ' ')
        cleaner = re.compile('<.*?>')
        cleantext = re.sub(cleaner, "\n", data)
        out.write(cleantext)

Upvotes: 1

Views: 536

Answers (1)

MisterMiyagi
MisterMiyagi

Reputation: 50106

If all your files are just one folder level deep, you have simply misplaced your placeholder. Use a placeholder for the unknown folder, not the file name:

# match "message.html" in all direct subfolders of "/Home/Username/Desktop/My_files/"
path = '/Home/Username/Desktop/My_files/*/message.html'

Note that if the filename is not constant either, glob also takes several placeholders:

# match any html file in all direct subfolders of "/Home/Username/Desktop/My_files/"
path = '/Home/Username/Desktop/My_files/*/*.html'

Things are trickier if not only direct subfolders are needed. Since Python3.5, glob.glob supports a recursive placeholder:

If recursive is true, the pattern "**" will match any files and zero or more directories and subdirectories.

In your case, this would look like this:

# match "message.html" in all subfolders of "/Home/Username/Desktop/My_files/"
path = '/Home/Username/Desktop/My_files/**/message.html'
files = glob.glob(path, recursive=True)

On an older Python version, you should walk the directories yourself. The os.walk function allows you to recursively inspect all files in subdirectories.

The following provides the full path to each file with a fixed name from a base directory:

def find_files(base_path, file_name):
    """Yield all paths to any file named ``file_name`` in a subdirectory of ``base_path``"""
    # recursively walk through all subdirectories
    for dirpath, dirnames, filenames in os.walk(base_path):
        # test if the file name occurs in the current subdirectory
        if file_name in filenames:
            yield os.path.join(base_path, dirpath, file_name)

You can use it in place of your glob result:

files = find_files('/Home/Username/Desktop/My_files/', 'message.html')
for file in files:
   ...

Upvotes: 2

Related Questions