user2241818
user2241818

Reputation: 11

Replace spaces in substrings in an HTML file

I have some html files that include links to files whose filenames include spaces. For example,

The rain in spain ... 
<a href="/path/filename with space.xls">Filename</a>
falls mainly on the plain.

<a href="/path/2nd filename with space.doc">2nd Filename</a>

There are often multiple links like this within the file. I would like to replace the spaces within just the filename itself but not touch spaces elsewhere in the file. For example:

<a href="/path/filename_with_space.xls">Filename</a>

I have tried with SED, but I can't seem to isolate the substitution to be between 2 regex patterns (sed seems to work line by line).

Any assistance would be appreciated.

Upvotes: 1

Views: 96

Answers (1)

000
000

Reputation: 27247

Do not use regex for this problem. Use an html parser. Here is a solution in python with BeautifulSoup:

from BeautifulSoup import BeautifulSoup

with open('Path/to/file', 'r') as content_file:
    content = content_file.read()

soup = BeautifulSoup(content)
for a in soup.findAll('a')
  a['href'] = a['href'].replace(" ", "_")

with open('Path/to/file.modified', 'w') as output_file:
    output_file.write(str(soup))

Upvotes: 3

Related Questions