Reputation: 11
I have some html files that include links to files whose filenames include spaces. For example,
The rain in spain ...
<a href="/path/filename with space.xls">Filename</a>
falls mainly on the plain.
<a href="/path/2nd filename with space.doc">2nd Filename</a>
There are often multiple links like this within the file. I would like to replace the spaces within just the filename itself but not touch spaces elsewhere in the file. For example:
<a href="/path/filename_with_space.xls">Filename</a>
I have tried with SED, but I can't seem to isolate the substitution to be between 2 regex patterns (sed seems to work line by line).
Any assistance would be appreciated.
Upvotes: 1
Views: 96
Reputation: 27247
Do not use regex for this problem. Use an html parser. Here is a solution in python with BeautifulSoup:
from BeautifulSoup import BeautifulSoup
with open('Path/to/file', 'r') as content_file:
content = content_file.read()
soup = BeautifulSoup(content)
for a in soup.findAll('a')
a['href'] = a['href'].replace(" ", "_")
with open('Path/to/file.modified', 'w') as output_file:
output_file.write(str(soup))
Upvotes: 3