bhowmik
bhowmik

Reputation: 13

Search and replace in a large single line file (~2GB) file in Linux

I have a large XML file which is approximately 2GB in size. To make things interesting the entire data is in a single line.

I am trying to insert a newline character at the end of specific tags in this file to make it a multiline file which will allow me to split it and do more with it.

root@server:~# sed -i -e 's/\<\/Dummy\>/\<\/Dummy\>\\\n/g' file_name

I've tried sed, vi and joe with no luck. The length of each node in the XML is different so I cannot split the file based on number of characters.

Is there a way for me to make this large single line file into a multiline file via the command line?

Upvotes: 1

Views: 566

Answers (4)

D. Lohrstr&#228;ter
D. Lohrstr&#228;ter

Reputation: 388

Try stream option:

xmllint --stream --format file_name > lintout.xml

Upvotes: 0

Benjamin W.
Benjamin W.

Reputation: 52112

I'm blatantly stealing my input from ghoti's answer:

$ cat file_name
<a><b></b><b></b></a><a><c></c></a>

There are a few things wrong with your try, modified to a shorter tag here:

sed -i -e 's/\<\/a\>/\<\/a\>\\\n/g' file_name
  • No need for -e in this case:

    sed -i 's/\<\/a\>/\<\/a\>\\\n/g' file_name
    
  • To avoid having to escape /, we can use a different delimiter:

    sed -i -e 's|\</a\>|\</a\>\\\n|g' file_name
    
  • If you escape < > with \< \>, sed1 thinks you mean "word boundaries", but in this case, you mean the literal < > and shouldn't escape them:

    sed -i -e 's|</a>|</a>\\\n|g' file_name
    

    This already does something:

    $ sed -i -e 's|</a>|</a>\\\n|g' file_name
    <a><b></b><b></b></a>\
    <a><c></c></a>\
    [empty line here]
    

So if you actually wanted the \ at the end of each line, we're almost there. (If not, you can just replace \\\n by \n.)

  • Cosmetics: no need to write out everything we've matched in the substitution, we can just use &:

    sed -i -e 's|</a>|&\\\n|g' file_name
    
  • And finally, if our file happens to end with <a> (which the example input does), we might want to remove the backslash (and newline!) from the end of our output:

    $ sed -e 's|</a>|&\\\n|g;s/\\\n$//' file_name
    <a><b></b><b></b></a>\
    <a><c></c></a>
    

Of course everything said about manipulating XML with non-XML tools still applies: you shouldn't do it, and if you do it, expect your solution to break easily.


1 At least GNU sed does, but this is tagged "Linux" and I assume you are using GNU sed.

Upvotes: 0

ghoti
ghoti

Reputation: 46826

I think I would actually do this with gawk rather than sed.

You haven't included input data, so I'll make some up.

$ printf '<a><b></b><b></b></a><a><c></c></a>' | gawk -vRS='</a>' '{print $0 RS}'
<a><b></b><b></b></a>
<a><c></c></a>

Normally, awk (or gawk) will consider each line to be a unique record, with each line split into fields delimited by whitespace.

If instead you split records by some XML tag, you can rely on the fact that print will append a newline as the default ORS (output record separator) after printing each "input record".

Unlike a sed solution which will attempt to read one entire "record" (line) into memory in order to perform actions on it, I suspect that this solution would step through your file only using enough memory to "remember" the space between record separators. (This addresses the "large file" concern.)

Three other things to note.

First, a record separator is NOT a concept native to XML, so any solution using sed, awk, or anything that does not natively interpret XML is a hack. You will always get better results using tools which natively support your data format.

Second, since in my example I've specificed a record separator with that is the close of an XML tag, the input data could be though to have THREE RECORDS, the third of which is null. If you have a newline after your final "record separator", that third record may be terminated with yet another RS in your output. Be warned. This is the result of thing #1.

Third, this is a gawk solution, not an awk solution, because other awk implementations generally do not support multiple characters as record separators.

YMMV. This is not a great solution, but it may be sufficient for your needs.

Upvotes: 1

tumdum
tumdum

Reputation: 2031

What you can do is format that into canonical xml with xmllint xmllint --format pathtofile.xml and then pipe that to sed.

Upvotes: 2

Related Questions