fairman
fairman

Reputation: 45

How to use sed to remove some characters from file?

I have this code in some file

<pre class="bbCodeCode" dir="ltr" data-xf-init="code-block" data-lang=""><code>-Fix numcer one/Two
-EMM Support
-Fix update &lt; broken
-Add support patch</code></pre>
</div>
</div><b><br />

I need to remove some characters and keep just this code

-Fix numcer one/Two
-EMM Support
-Fix update &lt; broken
-Add support patch

I have try this code

#!/bin/bash
sed -n '/>-/,/</p' /home/Desktop/1 > /home/Desktop/2
sed -n '/^-*code>/p' /home/raed/Desktop/2  > /home/Desktop/3
sed -i 's#</code></pre>##' /home/Desktop/3
exit

But the code remove first line

-Fix numcer one/Two

Upvotes: 2

Views: 41

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133760

1st solution: Try GNU awk for this one. With your shown samples please try following awk code.

awk -v RS="^$" '
match($0,/(^|\n)<pre class="[^"]*".*<code>-(.*)<\/code>/,arr){
  print arr[2]
}
'  Input_file

Explanation: Simple explanation would be, using GNU awk's capability to make RS ^$ and then using its match function to match regex (^|\n)<pre class="[^"]*".*<code>-(.*)<\/code>(explained later in this answer). This regex creates 2 capturing groups and all matched values are getting stored into array named arr. So if regex has matched values then I am simply printing 2nd element of array arr by using arr[2] to get desired values.



2nd solution: With sed using -z and -E options please try following code.

sed -zE 's/(^|\n)<pre class="[^"]*".*<code>-(.*)<\/code>.*/\2/' Input_file

OR if your sed version supports \n then with a slight change in above sed code you can have as follows:

sed -zE 's/(^|\n)<pre class="[^"]*".*<code>-(.*)<\/code>.*/\2\n/' Input_file


3rd solution: With GNU grep please try following code:

grep -zoP '(^|\n)<pre class="[^"]*".*?<code>-\K(.*?\n[^\n]+)+(?=</code>)'  Input_file


4th solution: If you really want to go with your approach(looks like you don't have GNU version of sed) then Let me try with your approach here but this will be very straight forward sed with little less validations for data compare to previous solutions of mine but this will do the job for you in case your sample Input_file is always same.

sed -En '/^<pre class/s/^<pre class="[^"]*".*<code>-(.*)$/\1/p; /^-/{s/<\/code>.*//; p}'  Input_file

Upvotes: 1

steffen
steffen

Reputation: 17058

Try this

sed 's/<[^>]*>//g' <file

It will remove everything between < and the next > (linewise).

Upvotes: 1

Related Questions