Trying to remove a specific html tag from a file. Question: How do I get the desired result? Should I be using the sed command for desired results? file: test1.txt Hello World </body> </html> sed sed -e 's/<\/body>\\n<\/html>\\n//' test1.txt > test2.txt Desired result in test2.txt Hello World Actual Hello World </body> </html>

awksedgrep

Lacer

Reputation: 5958

sed or other - remove specific html tag text from file

Trying to remove a specific html tag from a file.

Question:

How do I get the desired result?
Should I be using the sed command for desired results?

file: test1.txt

Hello World
</body>
</html>

sed

sed -e 's/<\/body>\\n<\/html>\\n//' test1.txt > test2.txt

Desired result in test2.txt

Hello World

Actual

Hello World
</body>
</html>

Upvotes: 1

Answers (5)

RavinderSingh13

Reputation: 133458

With your shown samples in awk(if ok) you could try following. Using RS and setting it to ^$ here. Also using match function of awk. So basically matching the string which is having new line in it and printing everything before and after it as per requirement.

awk -v RS="^$" '
match($0,/(^|\n)<\/body>\n<\/html>/){
  print substr($0,1,RSTART-1) substr($0,RSTART+RLENGTH)
}
'  Input_file

Upvotes: 4

anubhava

Reputation: 784998

Should I be using the sed command for desired results?

Actually grep suits it better with:

grep -Ev '</(body|html)>' file

Hello World

If you want to remove specific <body>\n</html>\n string only then use this sed that would work with any version of sed:

sed '/<\/body>/{N; /<\/html>/ {N; s~</body>\n</html>\n~~;};}' file

Hello World

Upvotes: 3

The fourth bird

Reputation: 163217

Another variant using sed:

sed '/<\/body>/{N;/\n<\/html>/d}' test1.txt > test2.txt

Match </body> and pull the next line into the pattern space using N. Then match on a newline followed by </html>.

If that matches, use d to delete what is in the pattern space.

The content of file 'test2.txt'

Hello World

Upvotes: 3

sseLtaH

Reputation: 11207

Using sed

$ sed -E '\~^</(body|html)>~d' input_file
Hello World

Upvotes: 3

Wiktor Stribiżew

Reputation: 626738

With GNU sed, you can use a -z option to match newlines:

sed -z -i 's#</body>\n</html>##g' file

Note that # is chosen as a regex delimiter char to avoid overescaping /. Also, -i makes changes directly into the input file.

See an online demo:

#!/bin/bash
s='Hello World
</body>
</html>'
sed -z 's#</body>\n</html>##g' <<< "$s"

Output:

Hello World

Upvotes: 3

sed or other - remove specific html tag text from file

Answers (5)

Related Questions