Reputation: 39
I have shell script for split xml files. but have one million xml files in Customer environment。the script running slow。could run Multithreading mode ?
Thanks!
my shell script:
#!/bin/sh
File=/home/spark/PktLog
count=0
startLine=(`sed -n -e '/?xml version="1.0" encoding/=' $File`)
fileEnd=`sed -n '$=' $File`
endLine=(`echo ${startLine[*]} | awk -v a=$fileEnd '{for(i=2;i<=NF;i++) printf("%d ",$i-1);print a}'`)
let maxIndex=${#startLine[@]}-1
for n in `seq 0 $maxIndex`
do
sed -n "${startLine[$n]},${endLine[$n]}p" $File >result_${n}.xml
done
echo $startLine[@]`enter code here`
Upvotes: 0
Views: 30
Reputation: 123680
Your method is very slow because it reads the input file many times.
Instead of trying to make it faster with multithreading, you should rewrite the script to only read the input file one time.
Here is an example input file:
$ cat testfile
<?xml version="1.0" encoding="UTF-8"?>
<test>
<some data />
</test>
<?xml version="1.0" encoding="UTF-8"?>
<test>
<more />
<data />
</test>
<?xml version="1.0" encoding="UTF-8"?>
<test>
<more type="data" />
</test>
Here is an awk
command that reads the file one time, and writes each document to a separate file:
$ awk 'BEGIN { file="/dev/null"; n=0; }
/xml version="1.0" encoding/ {
close(file);
file="file" ++n ".xml";
}
{print > file;}' testfile
Here is the result:
$ cat file1.xml
<?xml version="1.0" encoding="UTF-8"?>
<test>
<some data />
</test>
$ cat file2.xml
<?xml version="1.0" encoding="UTF-8"?>
<test>
<more />
<data />
</test>
This is much faster:
$ grep -c 'xml version' PktLog
3000
$ time ./yourscript
real 0m9.791s
user 0m6.849s
sys 0m2.660s
$ time ./thisscript
real 0m0.248s
user 0m0.130s
sys 0m0.107s
Upvotes: 1