Vandit Goel
Vandit Goel

Reputation: 803

How to extract multiple parts of a string using awk/sed/perl?

I search for log files having errors using egrep and it outputs a bunch of files. What I want to do is manipulate those strings and present in a different way.

/abcd/efgh/ijkl/logs/fac_unet_abp99507.log.20220708111219.26476752.0
/abcd/efgh/ijkl/logs/fac_oxf_abp3506.log.20220708111219.26476752.0
/abcd/efgh/ijkl/logs/cirrus_abp4296EI_20220824.log
/abcd/efgh/ijkl/mcr/logs/prof_cmcr_abp4296MR.log.20220824150526.15728964.0

The output should look like:

ABP99507,UNET
ABP3506,OXF
ABP4296EI,CIRRUS
ABP4296MR,CMCR

I tried awk and sed and couldn't figure out a way to do this. I want to be able to make it dynamic and do it via regular expressions.

What I have tried so far is:

egrep -li "^error" /abcd/efgh/ijkl/logs/*202207* | awk '/unet|cirrus|oxf|csp|cmcd|cmcr|nice/ {print}'
egrep -li "^error" /abcd/efgh/ijkl/logs/*202207* | sed -n "s/.*\(cirrus|unet|cmcr|csp|cmcd|oxf|nice\)\(abp[0-9]*[A-ZA-Za-za-z]*\).*/\1,\2/p"

Sed doesn't work as the "|" operator is taken as literal; I am not using GNU version. Even escaping it doesn't work. Also I can't seem to make use of capture groups.

Upvotes: -1

Views: 258

Answers (3)

Ed Morton
Ed Morton

Reputation: 204259

Throw away egprep (which is deprecated in favor of grep -E by the way), and just use awk, e.g. using an awk that supports nextfile such as GNU awk (also already supported in some other awks and soon will be required per POSIX):

awk -v OFS=',' '
    tolower($0) ~ /^error/ {
        split(toupper(FILENAME),a,/[_.]/)
        print a[3], a[2]
        nextfile
    }
' /abcd/efgh/ijkl/logs/*202207*

or using any awk:

awk -v OFS=',' '
    FNR==1 { searching=1 }
    searching && (tolower($0) ~ /^error/) {
        split(toupper(FILENAME),a,/[_.]/)
        print a[3], a[2]
        searching=0
    }
}' /abcd/efgh/ijkl/logs/*202207*

If you really want to implement what you were apparently trying to do with /unet|cirrus|oxf|csp|cmcd|cmcr|nice/ to restrict which files the script examines then change this:

awk -v OFS=',' '
    ...
}' /abcd/efgh/ijkl/logs/*202207*

to this:

shopt -s extglob
awk -v OFS=',' '
    ...
}' /abcd/efgh/ijkl/logs/*@(unet|cirrus|oxf|csp|cmcd|cmcr|nice)*202207*

Upvotes: 1

Daweo
Daweo

Reputation: 36680

Also I can't seem to make use of capture groups.

You did not escape | so they are meaning literal |, you need to escape it to mean alternative, as is case with ( and ) (literal vs group delimiter). After doing that and repairing minor issues I get it working: let file.txt content be

/abcd/efgh/ijkl/logs/fac_unet_abp99507.log.20220708111219.26476752.0
/abcd/efgh/ijkl/logs/fac_oxf_abp3506.log.20220708111219.26476752.0

then

sed -e 's/.*\(cirrus\|unet\|cmcr\|csp\|cmcd\|oxf\|nice\)_\(abp[0-9]*[A-ZA-Za-za-z]*\).*/\2,\1/' -e 's/[a-z]/\U&/g' file.txt

gives output

ABP99507,UNET
ABP3506,OXF

Explanation: I introduced following changes: escaped |, added _ between groups, change order of replacement (2nd group is first), dropped /p as it caused doubling output. After doing this I added second action: uppercasing using standard GNU sed way of doing so. As there are now 2 actions, I use -e to register them.

(tested in GNU sed 4.2.2)

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133680

1st solution: Simplest option would be, using awk's field separator option. With your shown samples please try following awk code.

awk -F'/|\\.|_' '{print toupper($8","$7)}' Input_file


2nd solution: In case you want to try with regular expression in awk then try. Written and tested in GNU awk.

awk 'match($0,/logs\/[^_]*_([^_]*)_([^.]*)\.log/,arr){print toupper(arr[2]","arr[1])}'  Input_file


3rd solution: With GNU sed's enabling ERE with -E option please try following code.

sed -E 's/.*logs\/[^_]*_([^_]*)_([^.]*)\.log\..*/\U\2,\U\1/' Input_file


4th solution: Adding a NON-GNU awk solution using match function.

awk '
match($0,/logs\/[^_]*_([^_]*)_([^.]*)\.log/){
  val=substr($0,RSTART+5,RLENGTH-5)
  sub(/\.log/,"",val)
  split(val,arr,"_")
  print toupper(arr[3]","arr[2])
}
'  Input_file

Upvotes: 2

Related Questions