alvas
alvas

Reputation: 122142

Combining remove tags regex and remove empty lines in sed - Unix

Given a markup file like this:

<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
<seg id="5">In addition, participants at the event plan to discuss the rules for forming an expert panel, which is responsible for evaluating the work of scientific groups, as well as the criteria for carrying out evaluations.</seg>
<seg id="6">The third Expert Session will be the final meeting in a series of events on the formation of a unified approach for all three academies to the evaluation of the effectiveness of activities of scientific organizations.</seg>
<seg id="7">Over the past five months, we were able to achieve this, and the final version of the regulatory documents is undergoing approval.</seg>
<seg id="8">According to the plans for the upcoming session, we should complete the development of procedures for scientometric and expert analysis, and come to an agreement on the stages and timeframes for the evaluation process”, said the Head of FANO’s Expert-Analytical Department, Elena Aksenova.</seg>
<seg id="9">Representatives from more than one hundred Russian scientific institutes will take part in the event.</seg>
<seg id="10">It is expected that a resolution will be adopted based on its results.</seg>
<seg id="11">The meeting will begin at 10 am, Moscow time, on September 16, 2014, at the following address: 14 Solyanka Street, Moscow.</seg>
</p>
</doc>
</srcset>

I can remove the markup tags with Sed remove tags from html file:

sed -e 's/<[^>]*>//g' file.txt 

which will leave me outputs with empty lines and I have to do this Delete empty lines using SED:

sed -e 's/<[^>]*>//g' file.txt  | sed '/^\s*$/d'

How should I combine the remove tag and remove empty lines regexes into one?

Upvotes: 2

Views: 90

Answers (1)

midori
midori

Reputation: 4837

What about deleting right away? :

sed -e 's/<[^>]*>//g;/^\s*$/d' file.txt

Upvotes: 2

Related Questions