Reputation: 21
I have fixed width delimited file as follows
aaaaa003aaaaaaaaaaaaaaa
bbbbb002aaaaaaaaaa
ccccc004cccccccccccccccccccc
I need to get it in the form
aaaaa003aaaaa
aaaaa003aaaaa
aaaaa003aaaaa
bbbbb002aaaaa
bbbbb002aaaaa
ccccc004ccccc
ccccc004ccccc
ccccc004ccccc
ccccc004ccccc
My current script is in efficient for 11 million lines. How can I optimise this?
#!/bin/sh
# My first Script
echo "Unbulking"
IN=$1
OUT=$2
while IFS= read -r line;do
HEAD=${line:0:8}
BODY=$(echo $line | sed -r ’s/.{8}//‘)
BODYVAR=$(echo $BODY |fold -w 5)
for i in ${BODYVAR}
do
echo $HEAD$i >> $OUT
done
done < $IN
echo "Completed"
My logic needs to be along the lines:
#take the first 8 characters of a line and assign to a str1
#take the last 3 characters of str1 and cast to a intger and assign to num1
#multiply num1 by 5 and assign to num2
#return the substring from char 8 to num2 and assign to str2
#cut str2 into chunks of 5 and assign to an array arr1
#concatenate str1 with each element of arr1
#return the arr1 as a set of new lines
#repeat for everyline in the file
Upvotes: 2
Views: 76
Reputation: 50775
Your entire script can be translated into gawk like this:
gawk 'BEGIN {
FPAT=".{1,5}"
OFS=""
}
{ head = substr($0,1,8)
$0 = substr($0,9)
for (i=1; i<=NF; i++)
print head, $i
}' file
Upvotes: 1
Reputation: 203712
Don't try to manipulate text with a shell loop as the extreme slowness you've already noticed is just one of the issues you'll have, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice for that issue and see https://mywiki.wooledge.org/Quotes, https://mywiki.wooledge.org/DontReadLinesWithFor, and Correct Bash and shell script variable capitalization for some of the other issues in the script you posted.
Using any awk in any shell on every UNIX box:
$ cat tst.awk
{
head = substr($0,1,8)
tail = substr($0,9)
while ( tail != "" ) {
print head substr(tail,1,5)
tail = substr(tail,6)
}
}
.
$ awk -f tst.awk file
aaaaa003aaaaa
aaaaa003aaaaa
aaaaa003aaaaa
bbbbb002aaaaa
bbbbb002aaaaa
ccccc004ccccc
ccccc004ccccc
ccccc004ccccc
ccccc004ccccc
Upvotes: 2
Reputation: 37414
One for GNU awk. It split
the record by string of digits and prints
$1
digits and $2 in 5 char parts:
$ gawk '{
split($0,a,/[0-9]+/,seps)
while(length(a[2])) {
print a[1] seps[1] substr(a[2],1,5)
a[2]=substr(a[2],6)
}
}' file
Output:
aaaaa003aaaaa
aaaaa003aaaaa
aaaaa003aaaaa
bbbbb002aaaaa
bbbbb002aaaaa
ccccc004ccccc
ccccc004ccccc
ccccc004ccccc
ccccc004ccccc
Gnu awk only as it uses the fourth parameter of split()
, seps.
Update: Another version:
$ awk '{
while(p=substr($0,9,5)) {
print substr($0,1,8) p
$0=substr($0,1,8) substr($0,14)
}
}'
Upvotes: 0