thisiscrazy4
thisiscrazy4

Reputation: 1965

Remove/replace html tags in bash

I have a file with lines that contain:

<li><b> Some Text:</b> More Text </li>

I want to remove the html tags and replace the </b> tag with a dash so it becomes like this:

Some Text:- More Text

I'm trying to use sed however I can't find the proper regex combination.

Upvotes: 11

Views: 17096

Answers (3)

Jacopo Pace
Jacopo Pace

Reputation: 492

In BASH you can also use a CLI browser or PHP to parse and clean any HTML, they are probably much more effective than sed (in this context) but, of course, you have to install them first:

W3M:

echo '<li><b> Some Text:</b> More Text </li>' | w3m -dump -T text/html

Lynx:

echo '<li><b> Some Text:</b> More Text </li>' | lynx --dump

Links:

echo '<li><b> Some Text:</b> More Text </li>' | links -dump

(Credits here)

PHP:

echo '<li>My html</li>' | php -r 'echo strip_tags(file_get_contents("php://stdin"));'

Remember:

  • The CLI browsers solution will literally render the HTML into text, removing any kind of code (inline JS and CSS as well) and also trying to format it at the best following HTML rules.

  • The PHP solution, by using strip_tags, will just remove all HTML tags and will keep all "non-html" stuff (spaces, tabs, inline CSS and JS code...) more or less like most sed/regex solutions would do.

Upvotes: 0

newfurniturey
newfurniturey

Reputation: 38416

If you strictly want to strip all HTML tags, but at the same time only replace the </b> tag with a -, you can chain two simple sed commands with a pipe:

cat your_file | sed 's|</b>|-|g' | sed 's|<[^>]*>||g' > stripped_file

This will pass all the file's contents to the first sed command that will handle replacing the </b> to a -. Then, the output of that will be piped to a sed that will replace all HTML tags with empty strings. The final output will be saved into the new file stripped_file.

Using a similar method as the other answer from @Steve, you could also use sed's -e option to chain expressions into a single (non-piped command); by adding -i, you can also read-in and replace the contents of your original file without the need for cat, or a new file:

sed -i -e 's|</b>|-|g' -e 's|<[^>]*>||g' your_file

This will do the replacement just as the chained-command above, however this time it will directly replace the contents in the input file. To save to a new file instead, remove the -i and add > stripped_file to the end (or whatever file-name you choose).

Upvotes: 18

Steve
Steve

Reputation: 54392

One way using GNU sed:

sed -e 's/<\/b>/-/g' -e 's/<[^>]*>//g' file.txt

Example:

echo "<li><b> Some Text:</b> More Text </li>" | sed -e 's/<\/b>/-/g' -e 's/<[^>]*>//g'

Result:

 Some Text:- More Text

Upvotes: 0

Related Questions