Reputation: 13
I have a large XML file which is approximately 2GB in size. To make things interesting the entire data is in a single line.
I am trying to insert a newline character at the end of specific tags in this file to make it a multiline file which will allow me to split it and do more with it.
root@server:~# sed -i -e 's/\<\/Dummy\>/\<\/Dummy\>\\\n/g' file_name
I've tried sed, vi and joe with no luck. The length of each node in the XML is different so I cannot split the file based on number of characters.
Is there a way for me to make this large single line file into a multiline file via the command line?
Upvotes: 1
Views: 566
Reputation: 388
Try stream option:
xmllint --stream --format file_name > lintout.xml
Upvotes: 0
Reputation: 52112
I'm blatantly stealing my input from ghoti's answer:
$ cat file_name
<a><b></b><b></b></a><a><c></c></a>
There are a few things wrong with your try, modified to a shorter tag here:
sed -i -e 's/\<\/a\>/\<\/a\>\\\n/g' file_name
No need for -e
in this case:
sed -i 's/\<\/a\>/\<\/a\>\\\n/g' file_name
To avoid having to escape /
, we can use a different delimiter:
sed -i -e 's|\</a\>|\</a\>\\\n|g' file_name
If you escape < >
with \< \>
, sed1 thinks you mean "word boundaries", but in this case, you mean the literal < >
and shouldn't escape them:
sed -i -e 's|</a>|</a>\\\n|g' file_name
This already does something:
$ sed -i -e 's|</a>|</a>\\\n|g' file_name
<a><b></b><b></b></a>\
<a><c></c></a>\
[empty line here]
So if you actually wanted the \
at the end of each line, we're almost there. (If not, you can just replace \\\n
by \n
.)
Cosmetics: no need to write out everything we've matched in the substitution, we can just use &
:
sed -i -e 's|</a>|&\\\n|g' file_name
And finally, if our file happens to end with <a>
(which the example input does), we might want to remove the backslash (and newline!) from the end of our output:
$ sed -e 's|</a>|&\\\n|g;s/\\\n$//' file_name
<a><b></b><b></b></a>\
<a><c></c></a>
Of course everything said about manipulating XML with non-XML tools still applies: you shouldn't do it, and if you do it, expect your solution to break easily.
1 At least GNU sed does, but this is tagged "Linux" and I assume you are using GNU sed.
Upvotes: 0
Reputation: 46826
I think I would actually do this with gawk rather than sed.
You haven't included input data, so I'll make some up.
$ printf '<a><b></b><b></b></a><a><c></c></a>' | gawk -vRS='</a>' '{print $0 RS}'
<a><b></b><b></b></a>
<a><c></c></a>
Normally, awk (or gawk) will consider each line to be a unique record, with each line split into fields delimited by whitespace.
If instead you split records by some XML tag, you can rely on the fact that print
will append a newline as the default ORS (output record separator) after printing each "input record".
Unlike a sed solution which will attempt to read one entire "record" (line) into memory in order to perform actions on it, I suspect that this solution would step through your file only using enough memory to "remember" the space between record separators. (This addresses the "large file" concern.)
Three other things to note.
First, a record separator is NOT a concept native to XML, so any solution using sed, awk, or anything that does not natively interpret XML is a hack. You will always get better results using tools which natively support your data format.
Second, since in my example I've specificed a record separator with that is the close of an XML tag, the input data could be though to have THREE RECORDS, the third of which is null. If you have a newline after your final "record separator", that third record may be terminated with yet another RS in your output. Be warned. This is the result of thing #1.
Third, this is a gawk solution, not an awk solution, because other awk implementations generally do not support multiple characters as record separators.
YMMV. This is not a great solution, but it may be sufficient for your needs.
Upvotes: 1
Reputation: 2031
What you can do is format that into canonical xml with xmllint xmllint --format pathtofile.xml
and then pipe that to sed.
Upvotes: 2