Jason
Jason

Reputation: 373

Out of memory processing large files with Perl, Sed, AWK

I'm extracting the content between XML tags using the following: -

perl -lne 'BEGIN{undef $/} while (/<tagname>(.*?)<\/tagname>/sg){print $1}' input.txt > output.txt

Unfortunately I'm getting out of memory issues, I know I can split the file and process each then concat but I wondered if there was another way, be it a modification to the above or using the likes of awk or sed?

The input.txt file size varies between 17GB and 70GB.

EDIT:

The input file can be any XML file, a point to note is that it contains no newlines, e.g. : -

<body><a></a><b></b><c></c></body><foo></foo><bar><z></z></bar>

Upvotes: 1

Views: 2573

Answers (6)

higuita
higuita

Reputation: 2315

You can also use awk to break a big, one line file. Sed will bust with out of memory as it tried to load the full line, but in awk (as in perl) you can define what you want to use as a "line break", bypassing the problem.

For perl, you already have one example above, here is the awk one:

cat big-one-line-file |  awk 'BEGIN { RS=">" } ; {print $0">"}'

Please note that in the end of file, one extra > will show up if the file doesn't end with a ">". You can remove it by any way (like a post cleaning sed: sed '$ s/>$//' ) or tune the script.

As i also had this problem, and to help others, i will add more examples to help testing.

You can test the script using dd to extract a small part of the file and catch bigger "Record Separator", like works or tags. Example:

dd if=big-one-line-file.xml bs=8192 count=10  | awk ' BEGIN { RS="<tag 123>" } ; NR>1 {print "<tag 123>"$0}  ; NR==1 {print $0}  ' 

extract the first 80kB of the big-one-line-file.xml and break the file in the "". To avoid the extra (and wrong) "" in the start of the file, treat it differently (ie: dont touch it)

use the dd option skip={# of blocks to reach near the file size} to extract the end of the file instead of the top (tail will fail because its always just one line). I used a skip=100000000 and start removing zeros until something show up and tuned up the block number.

Upvotes: 1

mirod
mirod

Reputation: 16171

It is not clear whether you input file is well-formed XML or not. The example you give is not XML (no root element). If the data is XML though, you can use xml_grep a tool that comes with XML::Twig. xml_grep -r tagname --text_only mybig.xml This will work on files of any size, provided each matched elements can fit in memory.

If this is too slow, you can probably gain some speed by using directly XML::Parser, the code would not be really complicated to write. It's easier not to have to write it though ;--)

Upvotes: 0

TLP
TLP

Reputation: 67910

In order to read smaller sized chunks from your file, you can set your input record separator to the closing tag:

BEGIN { $/ = "</tagname>"; }

Here's an example:

Code:

perl -lnwe 'BEGIN { $/ = "</tagname>"; } print;'

Input:

<tagname>foo</tagname><tagname>bar</tagname><tagname>baz</tagname><tagname>baf</tagname>

Output:

<tagname>foo
<tagname>bar
<tagname>baz
<tagname>baf

You'll note that the closing tag is missing, and that is because the -l option that you use also includes a chomp, which removes the input record separator. If you do not want this behaviour, simply remove the -l option and insert a newline in your print statement.

Note:

I would say this is somewhat of a hack, but it does match what you are already using, namely matching case sensitively, exact tags.

What you can do to compensate is use your regex inside of this:

perl -lnwe 'BEGIN { $/ = "</tagname>"; } 
    while (/<tagname>(.*?)<\/tagname>/sg) { print $1 }' input.txt > output.txt

Or, possibly, use an XML parser to parse the chunk.

If the XML parser suggested by others does not work for such huge files, this can be a way to read smaller chunks of data without risking cutting tags in half.

Upvotes: 3

choroba
choroba

Reputation: 242038

Parsing huge files should be possible with a pull-parser like XML::LibXML::Reader. Here is an example:

#!/usr/bin/perl
use warnings;
use strict;

use XML::LibXML::Reader;

my $reader = XML::LibXML::Reader->new(location => 'input.txt') or die;

while ($reader->read) {
    if ($reader->nodePath =~ m{/tagname$}                    # We are at <tagname> or </tagname>.
        and $reader->nodeType == XML_READER_TYPE_ELEMENT) {  # Only the start tag is interesting.
        print $reader->readInnerXml;
    }
}

Upvotes: 3

Stephane Rouberol
Stephane Rouberol

Reputation: 4384

I would apply a filter to your input file to introduce newlines. Maybe after each </tagname> ?. Then you will be able get rid of BEGIN{undef $/} in your perl command and avoid memory problems by dealing with "reasonable" records.

Upvotes: 0

Oleg V. Volkov
Oleg V. Volkov

Reputation: 22461

This one-liner reads entire file into memory as one gigantic "line". Of course you'll have problems with memory with stuffing 17GB and more into it! Read and process file line-by-line or use read to get chunks of suitable size instead.

In this case, search for <tagname>, note its position in line and search for closing tag starting from there. If you didn't find it, stuff current line/chunk into buffer and repeat until you've found it on some other line further in file. When found, print out this buffer and empty it. Repeat until the end of file.

Note that if you'd use arbitrary sized chunks, you'll have to account for possibility of tag split by boundary by cutting incomplete tag from end of chunk and stuffing it in "to process" buffer.

Upvotes: 3

Related Questions