CyberSamurai
CyberSamurai

Reputation: 179

Re-order columns in a text file by a specific pattern

I'm very new at awk and have been banging my head trying to get this to work. I'm trying to take a list of files in "image.list" and create an "info" file out of it. I need to grab the string matching a regex (a number 8-11 digits long) from the middle of the filename and print just that match into the designated spot in my "info file". That last part is the part I'm having trouble pulling off. Would love some help fixing that.

Here is my test file list:

SURGERY0001275678image1.jpg
SURGERY11134900211image2.jpg
SURGERY19257012image3.jpg
SURGERY273142590image4.jpg

Here is my current code:

awk 'BEGIN {print "-----TEST TAG FILE\tENCOUNTERS-----";}
> {print "FILE:  /tmp/imagetest/"$1,"\t","ENCOUNTER: ",($1~/^[0-9]{8,11}$/);}
> END{print "END REPORT";
> }' image.list > upload.tag

And here is my current output:

-----TEST TAG FILE      ENCOUNTERS-----
FILE:  /tmp/imagetest/SURGERY0001275678image1.jpg        ENCOUNTER:  0
FILE:  /tmp/imagetest/SURGERY11134900211image2.jpg       ENCOUNTER:  0
FILE:  /tmp/imagetest/SURGERY19257012image3.jpg          ENCOUNTER:  0
FILE:  /tmp/imagetest/SURGERY273142590image4.jpg         ENCOUNTER:  0
END REPORT

What i need it to display is the 8-11 digit number in the middle of the file name after "ENCOUNTER:". So far everything I've tried outputs either the whole filename or "0".

I'm probably way off course so I'd love to get some help from you experts!

Upvotes: 2

Views: 301

Answers (7)

jaypal singh
jaypal singh

Reputation: 77095

Re-using your existing code:

$ awk '
BEGIN {
    print "-----TEST TAG FILE\tENCOUNTERS-----";
}
match($0,/[^0-9]+([0-9]+)[^0-9]+/,ary) {
    print "FILE:  /tmp/imagetest/"$1,"\t","ENCOUNTER:"ary[1]
}
END { 
    print "END REPORT";
}' testfile

Test:

$ cat testfile
SURGERY0001275678image1.jpg
SURGERY11134900211image2.jpg
SURGERY19257012image3.jpg
SURGERY273142590image4.jpg

$ awk '
> BEGIN {
>     print "-----TEST TAG FILE\tENCOUNTERS-----";
> }
> match($0,/[^0-9]+([0-9]+)[^0-9]+/,ary) {
>     print "FILE:  /tmp/imagetest/"$1,"\t","ENCOUNTER:"ary[1]
> }
> END { 
>     print "END REPORT";
> }' testfile
-----TEST TAG FILE      ENCOUNTERS-----
FILE:  /tmp/imagetest/SURGERY0001275678image1.jpg        ENCOUNTER:0001275678
FILE:  /tmp/imagetest/SURGERY11134900211image2.jpg       ENCOUNTER:11134900211
FILE:  /tmp/imagetest/SURGERY19257012image3.jpg          ENCOUNTER:19257012
FILE:  /tmp/imagetest/SURGERY273142590image4.jpg         ENCOUNTER:273142590
END REPORT

As Ed Morton suggested in the comments, using array argument to match() this solution is GNU awk only.

Upvotes: 5

Ed Morton
Ed Morton

Reputation: 203368

Here's the commonly-written awk function "extract()" to extract a string that matches an RE:

awk -v re='<whatever>' '
function extract(str,regexp)
{ RMATCH = (match(str,regexp) ? substr(str,RSTART,RLENGTH) : "")
  return RSTART
}
extract($0,re) { print RMATCH }
'

Just set "re" to whatever you want to match, e.g.:

$ cat file
SURGERY0001275678image1.jpg
SURGERY11134900211image2.jpg
SURGERY19257012image3.jpg
SURGERY273142590image4.jpg

$ awk -v re='[[:digit:]]{8,11}' '
function extract(str,regexp)
{ RMATCH = (match(str,regexp) ? substr(str,RSTART,RLENGTH) : "")
  return RSTART
}
extract($0,re) { print RMATCH }
' file
0001275678
11134900211
19257012
273142590

or if you prefer a more specific solution using the same match()+substr() approach:

$ awk '
BEGIN{ print "-----TEST TAG FILE\tENCOUNTERS-----" }
{ printf "FILE:  %s\tENCOUNTER: %d\n", $0, (match($0,/[[:digit:]]{8,11}/) ? substr($0,RSTART,RLENGTH) : 0) }
END{ print "END REPORT" }
' file
-----TEST TAG FILE      ENCOUNTERS-----
FILE:  SURGERY0001275678image1.jpg      ENCOUNTER: 1275678
FILE:  SURGERY11134900211image2.jpg     ENCOUNTER: 11134900211
FILE:  SURGERY19257012image3.jpg        ENCOUNTER: 19257012
FILE:  SURGERY273142590image4.jpg       ENCOUNTER: 273142590
END REPORT

Note that if all of your file names follow the same pattern and don't have other digits before the stream of 8-11 digits you care about, you could just use [[:digit:]]+ as the matching RE instead of explicitly specifying the range [[:digit:]]{8,11} if you like.

Upvotes: 2

captcha
captcha

Reputation: 3756

GNU sed

sed -r -e 's#(.*)#FILE:\t/tmp/imagetest/\1#;s/([0-9]*)(i[^i]*)$/\1\2\tENCOUNTER:\1/;1i -----TEST TAG FILE      ENCOUNTERS-----' -e '$aEND REPORT' file
-----TEST TAG FILE      ENCOUNTERS-----
FILE:   /tmp/imagetest/SURGERY0001275678image1.jpg      ENCOUNTER:0001275678
FILE:   /tmp/imagetest/SURGERY11134900211image2.jpg     ENCOUNTER:11134900211
FILE:   /tmp/imagetest/SURGERY19257012image3.jpg        ENCOUNTER:19257012
FILE:   /tmp/imagetest/SURGERY273142590image4.jpg       ENCOUNTER:273142590
END REPORT

Upvotes: 3

Andrew Clark
Andrew Clark

Reputation: 208465

Try the following:

awk 'BEGIN {print "-----TEST TAG FILE\tENCOUNTERS-----";}
{print "FILE:  /tmp/imagetest/"$1,"\t","ENCOUNTER: ",gensub(/[^0-9]*([0-9]*).*/, "\\1", 1, $1);}
END{print "END REPORT";
}' image.list > upload.tag

Upvotes: 0

bartimar
bartimar

Reputation: 3534

This

awk 'BEGIN {print "-----TEST TAG FILE\tENCOUNTERS-----";}
{printf "FILE:  /tmp/imagetest/"$1"\tENCOUNTER: ";if($1~/[0-9]{8,11}/){sub(/
[0-9]+\.jpg$/,"",$1); gsub(/[a-zA-Z]/,"",$1);print $1}}
END{print "END REPORT";
}' image.list

will print

-----TEST TAG FILE      ENCOUNTERS-----
FILE:  /tmp/imagetest/SURGERY0001275678image1.jpg        ENCOUNTER: 0001275678
FILE:  /tmp/imagetest/SURGERY11134900211image2.jpg       ENCOUNTER: 11134900211
FILE:  /tmp/imagetest/SURGERY19257012image3.jpg          ENCOUNTER: 19257012
FILE:  /tmp/imagetest/SURGERY273142590image4.jpg         ENCOUNTER: 273142590
END REPORT

Upvotes: 0

Barmar
Barmar

Reputation: 780899

awk '{encounter=$1; sub("^[^0-9]*([0-9]{8,11}).*", "\\1", encounter);
      print "FILE:  /tmp/imagetest/"$1,"\t","ENCOUNTER: ",encounter;}'

Upvotes: 0

Fredrik Pihl
Fredrik Pihl

Reputation: 45652

Try this:

$ cat input
SURGERY0001275678image1.jpg
SURGERY11134900211image2.jpg
SURGERY19257012image3.jpg
SURGERY273142590image4.jpg

$ awk '{split($1,a,/[[:alpha:]]*/);print a[2]}' input
0001275678
11134900211
19257012
273142590

Upvotes: 0

Related Questions