ZexalDaBeast
ZexalDaBeast

Reputation: 152

Use sed to remove multiple lines between 2 sets of characters

I'm using sed on a macOS X computer.

I have a set of very large financial 10-K files and I want to keep only the text.

Right now, I'm trying to remove all of the information between

<TYPE>XML

and

<DOCUMENT>

Usually there is a lot of information between the two but here is how a sample would look:

#Other things I want to keep
<TYPE>XML
<SEQUENCE>10
<FILENAME>rht-10qq3fy19_htm.xml
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
<XML>
<?xml version="1.0" encoding="utf-8"?>
<xbrl
...
<DOCUMENT>
#Some other text I need to keep

I've been trying to use sed without much results, I can only get it to remove single line entries like

<TYPE>XML SOME WORDS SOME WORDS <DOCUMENT>

I used this code to get that to work:

sed -i '' s/<TYPE>XML.*<DOCUMENT>//g' filename.txt

What should I change to get the result I want?

Once I can solve this, the other things I need to clean should also be easier. The solution doesn't have to use sed.

I'm using -i and '' at the beginning of the sed command because I'm on a Mac (BSD) and I'm modifying data in place.

Upvotes: 1

Views: 491

Answers (1)

frangaren
frangaren

Reputation: 498

If I haven't misunderstood you, this will work for you:

sed '/<TYPE>XML/,/<DOCUMENT>/d' filename.txt

For anyone else looking for how to delete text between two patterns, use:

sed '/START_PATTERN/,/END_PATTERN/d' filename.txt

Upvotes: 2

Related Questions