Reputation: 17
I'm busy with a schoolproject. With tesseract I extract numbers out of a picture. The output I get can be something like this:
7586630342033088866
What I need is to extract every 4 Digit number beginning with 63 or 62.
so in this example it should be 6303. If I get a longer number like :
7586630342033088866234
the output should be 6303 6234
I would like to do this in a terminal script since I download my pics, pre-proces and run tesseract with a single script in terminal.
I tried some things with sed and awk but with no succes.
here is the end of the script I'm already using.
echo "\n run tesseract"
cd /media/nummer/tramnummerNummer
x=0 # set to 0 counter
keyword='tramnummer' # set basename for file rename
extention='*.JPG' # extention type of file to process
for i in `ls $extention` #list file by extention
do # do loop
x=`expr $x + 1` # increase counter
tesseract tramnummer$x.JPG tramnummer$x -l bet -psm 6 #run tesseract on all files
tr -d [:space:] <tramnummer$x.txt > tramnummer$x # remove white space from tess generated files
# sed 's/\(.\)/\1\n/g' -i tramnummer$x # some thing i tried , it puts every number on a separate line
# sed 's/[^6]*\(6.*\)/\1/' -i tramnummer$x # other thing i tried, it deletes every char before encountering a 6
done
Can anyone help me with this or put me on the right track ? Thanks in advance.
Upvotes: 1
Views: 65
Reputation: 67301
use this:
s='7586630342033088866234'
echo "$s" |perl -lne 'push @a,/6[23]../g;print "@a";undef @a'
Upvotes: 0
Reputation: 785701
Using egrep -o
:
s='7586630342033088866234'
echo "$s" | egrep -o '6[23][0-9]{2}'
6303
6234
Upvotes: 2