Reputation: 311
I have a awk based splitter that splits a huge file based on regex. But the problem is that I am getting a makes too many files error. Even i have a conditional close. If you could help me figure out what I am doing wrong I would be much grateful.
awk 'BEGIN { system("mkdir -p splitted/sub"++j) }
/<doc/{x="F"++i".xml";}{
if (i%5==0 ){
++i;
close("splitted/sub"j"/"x);
system("mkdir -p splitted/sub"++j"/");
}
else{
print > ("splitted/sub"j"/"x);
}
}' wiki_parsed.xml
Upvotes: 1
Views: 3686
Reputation: 311
This is what i got it to be working perfectly
awk 'BEGIN { system("mkdir -p splitted/sub"++j) }
/<doc/{x="F"++i".xml";}{
if (i%1995==0 ){
++i;
system("mkdir -p splitted/sub"++j"/");
}
else{
print >> ("splitted/sub"j"/"x);
close("splitted/sub"j"/"x);
}
}' wiki_parsed.xml
Upvotes: 1
Reputation: 2514
The simple answer is that close isn't being called often enough. Here's an illustrative example of why:
Using an input file like:
<doc somestuff
another line
yet another line
<doc the second
still more data
<doc the third
<doc the fourth
<doc the fifth
I can make an executable awk file based on your script like:
#!/usr/bin/awk -f
BEGIN { system_(++j) }
/<doc/{x=++i}
{
if (i%5==0 ){ ++i; close_(j"/"x); system_(++j) }
else{ open_(j"/"x) }
}
function call_f(funcname, arg) { print funcname"("arg")" }
function system_(cnt) { call_f( "system", cnt ) }
function open_(f) { if( !(f in a) ) { call_f( "open", f ); a[f]++ } }
function close_(f) { call_f( "close", f ) }
which if I put into a file called awko
can be run like awko data
to produce the following:
system(1)
open(1/1)
open(1/2)
open(1/3)
open(1/4)
close(1/5)
system(2)
The script I made is just indicating how many times you're calling each function by shadowing a real function call with a local function with a trailing _
. Notice how many times open()
is printed compared to close()
for the same arguments. Also, I ended up renaming print >
to open_
just to illustrated that it's what's opening the files( once per file name ).
If I change the executable awk file to the following, you can see close being called enough:
#!/usr/bin/awk -f
BEGIN { system_(++j) }
/<doc/{ close_(j"/"x); x=++i } # close_() call is moved to here.
{
if (i%5==0 ){ ++i; system_(++j) }
else{ open_(j"/"x) }
}
function call_f(funcname, arg) { print funcname"("arg")" }
function system_(cnt) { call_f( "system", cnt ) }
function open_(f) { if( !(f in a) ) { call_f( "open", f ); a[f]++ } }
function close_(f) { call_f( "close", f ) }
which gives the following output:
system(1)
close(1/)
open(1/1)
close(1/1)
open(1/2)
close(1/2)
open(1/3)
close(1/3)
open(1/4)
close(1/4)
system(2)
where it should be clear that close()
is being called one more time than enough. The first time it's being called on a file that doesn't exist. With a true close()
call, the fact that such a file has never been printed should just be ignored and no actual close will be attempted. In each other case, the last open()
matches a close()
call.
Moving your close()
call in your script as in the second example script should fix your error.
Upvotes: 4