Reputation: 23
My input files are html files with no extension. Desired output is regex matched URLs from all files from the root_dir and results joined in single file. My regex works and I can output results from a single file.
import re
with open('/Users/files/filename') as f:
for line in f:
urls = re.findall (r"([\w%~\+-=]*\.mp3)", line);
print (*urls)
I could use glob but unsure how to:
import glob
import re
root_dir = '/Users/files/'
for filename in glob.iglob(root_dir + '**/*.*', recursive=True):
urls = re.findall (r"([\w%~\+-=]*\.mp3)", line);
print (*urls)
Upvotes: 2
Views: 67
Reputation: 18611
Use
import re, glob # Import the libraries
root_dir = r'/Users/files' # Set root directory
save_to_file = r'/Users/urls_extracted.txt' # File path to save results to
all_files = glob.glob("{}/*".format(root_dir)) # Get a glob with filepaths
with open(save_to_file, 'w') as fw: # Open stream to write to
for filename in all_files: # Iterate over the files
with open(filename, 'r') as fr: # Open file to read from
for url in re.findall(r"[\w%~+\-=]*\.mp3", fr.read()): # Get all matches and iterate over them
fw.write("{}\n".format(url)) # Write each URL to write stream
Note that the dash must be escaped in the regular expression if you meant a -
character and not a range.
Upvotes: 1