user10254032
user10254032

Reputation: 193

concatenating string with multiple array

I'm trying to rearrange from a specific string into the respective column. Here is the input

String 1:  47/13528 
String 2:  55(s) 
String 3:   
String 4:  114(n) 
String 5:  225(s), 26/10533-10541 
String 6:  103/13519 
String 7:  10(s), 162(n) 
String 8:  152/12345,12346
(d=dead, n=null, s=strike) 

The alphabet in each value is the flag (d=dead, n=null, s=strike). The String with value (digit) which is "String 1" will be the 47c1 etc:

String 1:  47/13528 
value without any flag will be sorted into the null column along with null tag (n)
String 1 (the integer will be concatenated with 47/13528)


Sorted : 
null
47c1@SP13528;114c4;103c6@SP13519;162c7


Str#2:  55(s)
flagged with (s) will be sorted into strike column

Sorted :
strike
55c2;225c5;26c5@SP10533-10541;162c7

I'm trying to parse it by modifying previous code, seems no luck

{
    for (i=1; i<=NF; i++) {
        num  = $i+0
        abbr = $i
        gsub(/[^[:alpha:]]/,"",abbr)
        list[abbr] = list[abbr] num " c " val ORS
    }
}
END {
    n = split("dead null strike",types)
    for (i=1; i<=n; i++) {
        name = types[i]
        abbr = substr(name,1,1)
        printf "name,list[abbr]\n" 
    }
}

Expected Output (sorted into csv) :

dead,null,strike
,47c1@SP13528;114c4; 26c5@SP10533-10541;103c6@SP13519;162c7, 152c8@SP12345;152c8@SP12346,55c2;225c5;162c7;10c7

Breakdown for crosscheck purpose:

dead
none 

null
47c1@SP13528;114c4;103c6@SP13519;162c7;152c8@SP12345;152c8@SP12346;26c5@SP10533-10541;;162c7

strike
55c2;225c5;10c7

Upvotes: 1

Views: 138

Answers (2)

thanasisp
thanasisp

Reputation: 5965

Here is an awk script for parsing your file.

BEGIN {
    types["d"]; types["n"]; types["s"]
    deft = "n"; OFS = ","; sep = ";"
}

$1=="String" {
    gsub(/[)(]/,""); gsub(",", " ")    # general line subs
    for (i=3;i<=NF;i++) {
        if (!gsub("/","c"$2+0"@SP", $i)) $i = $i"c"$2+0    # make all subs on items
        for (t in types) { if (gsub(t, "", $i)) { x=t; break }; x=deft } #find type
        items[x] = items[x]? items[x] sep $i: $i    # append for type found
    }
}

END {
    print "dead" OFS "null" OFS "strike"
    print items["d"] OFS items["n"] OFS items["s"]
}

Input:

String 1:  47/13528 
String 2:  55(s) 
String 3:   
String 4:  114(n) 
String 5:  225(s), 26/10533-10541 
String 6:  103/13519 
String 7:  10(s), 162(n) 
String 8:  152/12345,12346
(d=dead, n=null, s=strike) 

Output:

> awk -f tst.awk file
dead,null,strike
,47c1@SP13528;114c4;26c5@SP10533-10541;103c6@SP13519;162c7;152c8@SP12345;12346c8,55c2;225c5;10c7

Your description was changing on important details, like how we decide the type of an item or how they are separated, and untill now your input and outputs are not consistent to it, but in general I think you can easily get what is done into this script. Have in mind that gsub() returns the number of the substitutions made, while doing them also, so many times it is convenient to use it as a condition.

Upvotes: 1

KamilCuk
KamilCuk

Reputation: 140940

My usuall approuch is:

  1. First preprocess the data to have one information on one line.
  2. Then preprocess the data to have one information in one column row wise.
  3. Then it's easy - just accumulate columns in some array in awk and print them.

The following code:

cat <<EOF |
String 1:  47/13528 
String 2:  55(s) 
String 3:   
String 4:  114(n) 
String 5:  225(s), 26/10533-10541 
String 6:  103/13519 
String 7:  10(s), 162(n) 
String 8:  152/12345,12346
(d=dead, n=null, s=strike) 
EOF
sed '
    # filter only lines with String
    /^String \([0-9]*\): */!d;
    # Remove the String
    # Remove the : and spaces
    s//\1 /
    # remove trailing spaces
    s/ *$//
    # Remove lines with nothing
    /^[0-9]* *$/d
    # remove the commas and split lines on comma
    # by moving them to separate lines
    # repeat that until a comma is found
    : a
    /\([0-9]*\) \(.*\), *\(.*\)/{
        s//\1 \2\n\1 \3/
        ba
    }
' | sed '
    # we should be having two fields here
    # separated by a single space
    /^[^ ]* [^ ]*$/!{
        s/.*/ERROR: "&"/
        q1
    }
    # Move the name in braces to separate column
    /(\(.\))$/{
        s// \1/
        b not
    } ; {
        # default is n
        s/$/ n/
    } ; : not
    # shuffle first and second field
    # to that <num>c<num>(@SP<something>)? format
    # if second field has a "/"
    \~^\([0-9]*\) \([0-9]*\)/\([^ ]*\)~{
        # then add a SP
        s//\2c\1@SP\3/
        b not2
    } ; {
        # otherwise just do a "c" between
        s/\([0-9]*\) \([0-9]*\)/\2c\1/
    } ; : not2
' |
sort -n -k1 |
# now it's trivial
awk '
{ 
    out[$2] = out[$2] (!length(out[$2])?"":";") $1
}

function outputit(name, idx) {
    print name
    if (length(out[idx]) == 0) {
        print "none"
    } else {
        print out[idx]
    }
    printf "\n"
}

END{
    outputit("dead", "d")
    outputit("null", "n")
    outputit("strike", "s")
}
'

outputs on repl:

dead
none

null
26c5@SP10533-10541;47c1@SP13528;103c6@SP13519;114c4;152c8@SP12345;162c7;12346c8

strike
10c7;55c2;225c5

The output I believe matches yours up to the sorting order with the ; separated list, which you seem to sort first column then second column, I just sorted with sort.

Upvotes: 1

Related Questions