vanziro
vanziro

Reputation: 23

Recursive Regex with Joined Results

My input files are html files with no extension. Desired output is regex matched URLs from all files from the root_dir and results joined in single file. My regex works and I can output results from a single file.

import re
with open('/Users/files/filename') as f:
    for line in f:
        urls = re.findall (r"([\w%~\+-=]*\.mp3)", line);
        print (*urls)

I could use glob but unsure how to:

import glob
import re
root_dir = '/Users/files/'
for filename in glob.iglob(root_dir + '**/*.*', recursive=True):
        urls = re.findall (r"([\w%~\+-=]*\.mp3)", line);
        print (*urls)

Upvotes: 2

Views: 67

Answers (1)

Ryszard Czech
Ryszard Czech

Reputation: 18611

Use

import re, glob                                 # Import the libraries

root_dir = r'/Users/files'                      # Set root directory
save_to_file = r'/Users/urls_extracted.txt'     # File path to save results to
all_files = glob.glob("{}/*".format(root_dir))  # Get a glob with filepaths

with open(save_to_file, 'w') as fw:             # Open stream to write to
  for filename in all_files:                    # Iterate over the files
    with open(filename, 'r') as fr:             # Open file to read from  
      for url in re.findall(r"[\w%~+\-=]*\.mp3", fr.read()): # Get all matches and iterate over them
        fw.write("{}\n".format(url))            # Write each URL to write stream

Note that the dash must be escaped in the regular expression if you meant a - character and not a range.

Upvotes: 1

Related Questions