Reputation: 1965
I have a file with lines that contain:
<li><b> Some Text:</b> More Text </li>
I want to remove the html tags and replace the </b>
tag with a dash so it becomes like this:
Some Text:- More Text
I'm trying to use sed however I can't find the proper regex combination.
Upvotes: 11
Views: 17096
Reputation: 492
In BASH you can also use a CLI browser
or PHP
to parse and clean any HTML, they are probably much more effective than sed
(in this context) but, of course, you have to install them first:
W3M:
echo '<li><b> Some Text:</b> More Text </li>' | w3m -dump -T text/html
Lynx:
echo '<li><b> Some Text:</b> More Text </li>' | lynx --dump
Links:
echo '<li><b> Some Text:</b> More Text </li>' | links -dump
PHP:
echo '<li>My html</li>' | php -r 'echo strip_tags(file_get_contents("php://stdin"));'
Remember:
The CLI browsers
solution will literally render the HTML into text, removing any kind of code (inline JS and CSS as well) and also trying to format it at the best following HTML rules.
The PHP
solution, by using strip_tags, will just remove all HTML tags and will keep all "non-html" stuff (spaces, tabs, inline CSS and JS code...) more or less like most sed
/regex
solutions would do.
Upvotes: 0
Reputation: 38416
If you strictly want to strip all HTML tags, but at the same time only replace the </b>
tag with a -
, you can chain two simple sed
commands with a pipe:
cat your_file | sed 's|</b>|-|g' | sed 's|<[^>]*>||g' > stripped_file
This will pass all the file's contents to the first sed
command that will handle replacing the </b>
to a -
. Then, the output of that will be piped to a sed
that will replace all HTML tags with empty strings. The final output will be saved into the new file stripped_file
.
Using a similar method as the other answer from @Steve, you could also use sed
's -e
option to chain expressions into a single (non-piped command); by adding -i
, you can also read-in and replace the contents of your original file without the need for cat
, or a new file:
sed -i -e 's|</b>|-|g' -e 's|<[^>]*>||g' your_file
This will do the replacement just as the chained-command above, however this time it will directly replace the contents in the input file. To save to a new file instead, remove the -i
and add > stripped_file
to the end (or whatever file-name you choose).
Upvotes: 18
Reputation: 54392
One way using GNU sed
:
sed -e 's/<\/b>/-/g' -e 's/<[^>]*>//g' file.txt
Example:
echo "<li><b> Some Text:</b> More Text </li>" | sed -e 's/<\/b>/-/g' -e 's/<[^>]*>//g'
Result:
Some Text:- More Text
Upvotes: 0