DenK
DenK

Reputation: 17

isolating a group of characters

I'm busy with a schoolproject. With tesseract I extract numbers out of a picture. The output I get can be something like this:

7586630342033088866

What I need is to extract every 4 Digit number beginning with 63 or 62.

so in this example it should be 6303. If I get a longer number like :

7586630342033088866234

the output should be 6303 6234

I would like to do this in a terminal script since I download my pics, pre-proces and run tesseract with a single script in terminal.

I tried some things with sed and awk but with no succes.

here is the end of the script I'm already using.

echo "\n run tesseract"
        cd /media/nummer/tramnummerNummer
        x=0                             # set to 0 counter
        keyword='tramnummer'            # set basename for file rename
        extention='*.JPG'               # extention type of file to process
        for i in `ls $extention`        #list file by extention
        do                              # do loop
        x=`expr $x + 1`                 # increase counter

        tesseract tramnummer$x.JPG tramnummer$x -l bet -psm 6      #run tesseract on all files
        tr -d [:space:] <tramnummer$x.txt > tramnummer$x           # remove white space from tess generated files
#       sed 's/\(.\)/\1\n/g' -i tramnummer$x            # some thing i tried , it puts every number on a separate line
#       sed 's/[^6]*\(6.*\)/\1/' -i tramnummer$x        # other thing i tried, it deletes every char before encountering a 6 
        done

Can anyone help me with this or put me on the right track ? Thanks in advance.

Upvotes: 1

Views: 65

Answers (2)

Vijay
Vijay

Reputation: 67301

use this:

s='7586630342033088866234'
echo "$s" |perl -lne 'push @a,/6[23]../g;print "@a";undef @a'

test

Upvotes: 0

anubhava
anubhava

Reputation: 785701

Using egrep -o:

s='7586630342033088866234'
echo "$s" | egrep -o '6[23][0-9]{2}'
6303
6234

Upvotes: 2

Related Questions