Gang  Yan
Gang Yan

Reputation: 39

Shell script Multithreading running

I have shell script for split xml files. but have one million xml files in Customer environment。the script running slow。could run Multithreading mode ?

Thanks!

my shell script:

#!/bin/sh
File=/home/spark/PktLog
count=0
startLine=(`sed -n -e '/?xml version="1.0" encoding/=' $File`)
fileEnd=`sed -n '$=' $File`
endLine=(`echo ${startLine[*]} | awk -v a=$fileEnd '{for(i=2;i<=NF;i++) printf("%d ",$i-1);print a}'`)

let maxIndex=${#startLine[@]}-1

for n in `seq 0 $maxIndex`

do
    sed -n "${startLine[$n]},${endLine[$n]}p" $File >result_${n}.xml
done

echo $startLine[@]`enter code here`

Upvotes: 0

Views: 30

Answers (1)

that other guy
that other guy

Reputation: 123680

Your method is very slow because it reads the input file many times.

Instead of trying to make it faster with multithreading, you should rewrite the script to only read the input file one time.

Here is an example input file:

$ cat testfile
<?xml version="1.0" encoding="UTF-8"?>
<test>
  <some data />
</test>
<?xml version="1.0" encoding="UTF-8"?>
<test>
  <more />
  <data />
</test>
<?xml version="1.0" encoding="UTF-8"?>
<test>
  <more type="data" />
</test>

Here is an awk command that reads the file one time, and writes each document to a separate file:

$ awk 'BEGIN { file="/dev/null"; n=0; }
       /xml version="1.0" encoding/ {
          close(file); 
          file="file" ++n ".xml"; 
       }
       {print > file;}' testfile

Here is the result:

$ cat file1.xml
<?xml version="1.0" encoding="UTF-8"?>
<test>
  <some data />
</test>

$ cat file2.xml
<?xml version="1.0" encoding="UTF-8"?>
<test>
  <more />
  <data />
</test>

This is much faster:

$ grep -c 'xml version' PktLog
3000

$ time ./yourscript    
real    0m9.791s
user    0m6.849s
sys     0m2.660s

$ time ./thisscript
real    0m0.248s
user    0m0.130s
sys     0m0.107s

Upvotes: 1

Related Questions