Reputation: 152
I'm using sed on a macOS X computer.
I have a set of very large financial 10-K files and I want to keep only the text.
Right now, I'm trying to remove all of the information between
<TYPE>XML
and
<DOCUMENT>
Usually there is a lot of information between the two but here is how a sample would look:
#Other things I want to keep
<TYPE>XML
<SEQUENCE>10
<FILENAME>rht-10qq3fy19_htm.xml
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
<XML>
<?xml version="1.0" encoding="utf-8"?>
<xbrl
...
<DOCUMENT>
#Some other text I need to keep
I've been trying to use sed without much results, I can only get it to remove single line entries like
<TYPE>XML SOME WORDS SOME WORDS <DOCUMENT>
I used this code to get that to work:
sed -i '' s/<TYPE>XML.*<DOCUMENT>//g' filename.txt
What should I change to get the result I want?
Once I can solve this, the other things I need to clean should also be easier. The solution doesn't have to use sed.
I'm using -i
and ''
at the beginning of the sed command because I'm on a Mac (BSD) and I'm modifying data in place.
Upvotes: 1
Views: 491
Reputation: 498
If I haven't misunderstood you, this will work for you:
sed '/<TYPE>XML/,/<DOCUMENT>/d' filename.txt
For anyone else looking for how to delete text between two patterns, use:
sed '/START_PATTERN/,/END_PATTERN/d' filename.txt
Upvotes: 2