Arthis
Arthis

Reputation: 77

Awk: Stop 'print' adding newline and how to loop/automate

As it stands, I have tab delimited data laid out like this (headers added here for clarity):

EntryID    GroupID    Result
039848     00100      Description 1  
088345     00200      Description 2
748572     00435      Description 3
884938     00200      Description 2
000392     00200      Description 3
008429     00100      Description 4

What I am trying to do is condense my data into groups. I wish to output a table with column A being groupIDs (with no duplication) and column B being a combination of all descriptions associated with that group. An example output would be:

00100      Description 1 | Description 4
00200      Description 2 | Description 2| Description 3
00435      Description 3

I've tried to write an awk command to produce one line at a time, given a Group ID as a parameter:

$ awk -F '\t' '/00100/ { print $2 '\t' $3 }' table.txt > output.txt

This works, however each hit is printed on a newline, like this

00100    Description 1
00100    Description 2

etc

I gather that this can be solved by specifying the ORS to an alternate character, or using printf rather than print, but when I try either of these

$ awk -F '\t' 'BEGIN {ORS = '\t'} /00100/ { print $2 '\t' $3 }' table.txt > output.txt
or
$ awk -F '\t' '/00100/ { printf $2 '\t' $3 }' table.txt > output.txt

Nothing actually changed in the output.

Once I get that solved, the other problem I have is that I have thousands of groups to repeat this with. I have a list of every group ID present in the data, stored in a different file, and I'd like to automate feeding that to awk for each ID.

I've tried modifying a command I've seen used to feed IDs to grep in a similar fashion, but I haven't had any luck with that either, as it just hangs:

$ for i in `$ cat groupIDs.txt`; do awk -F '\t' '/$i/ { print $2 '\t' $3 }' table.txt' >> test_results.txt ; done;

Any ideas how I can solve these issues?

Upvotes: 3

Views: 1325

Answers (4)

Digital Trauma
Digital Trauma

Reputation: 15986

I'm not much on awk, but you can do this with bash, sort, grep, cut and paste:

#!/bin/bash

groups=$(cut -f2 "$1" | sort -u)
for group in $groups ; do
    echo -n "$group "
    cut -f2- "$1" | grep "^$group" | cut -f2 | paste -d"|" -s -
done

This produces the following output:

00100   Description 1|Description 4
00200   Description 2|Description 2|Description 3
00435   Description 3

Not sure if the output delimiter has to be " | " or if "|" will do.

Upvotes: 3

konsolebox
konsolebox

Reputation: 75458

Code:

#!/usr/bin/awk -f

BEGIN {
    FS = OFS = "\t"
    getline
}
{
    if ($2 in a) {
        a[$2] = a[$2] " | " $3
    } else {
        a[$2] = $3
        b[i++] = $2
    }
}
END {
    for (j = 0; j < i; ++j) {
        k = b[j]
        print k, a[k]
    }
}

Input:

EntryID GroupID Result
039848  00100   Description 1
088345  00200   Description 2
748572  00435   Description 3
884938  00200   Description 2
000392  00200   Description 3
008429  00100   Description 4

Output:

00100   Description 1 | Description 4
00200   Description 2 | Description 2 | Description 3
00435   Description 3

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203209

$ cat tst.awk
BEGIN {
    FS=OFS="\t"
    split(tgtS,tmpA,/,/)
    for (i in tmpA)
        tgtA[tmpA[i]]
}

(!tgtS) || ($2 in tgtA) {
    descs[$2] = descs[$2] sep[$2] $3
    sep[$2]=" | "
}

END {
    for (gid in descs)
        print gid, descs[gid]
}
$ 
$ gawk -f tst.awk file
00435   Description 3
00100   Description 1 | Description 4
00200   Description 2 | Description 2 | Description 3
$ 
$ gawk -v tgtS="00100" -f tst.awk file
00100   Description 1 | Description 4
$ 
$ gawk -v tgtS="00100,00200" -f tst.awk file
00100   Description 1 | Description 4
00200   Description 2 | Description 2 | Description 3

Upvotes: 0

user000001
user000001

Reputation: 33317

You can try this awk command:

$ awk '{i=$2;$1=""; $2="";a[i]=a[i]?a[i]" |"$0:$0}END{for (i in a) print i, a[i]} ' file
00435   Description 3
00100   Description 1 |  Description 4
00200   Description 2 |  Description 2 |  Description 3

Or since the file is tab delimited, you can simplify it to

$  awk -F'\t' '{a[$2]=a[$2]?a[$2]" | "$3:$3}END{for (i in a) print i"\t"a[i]} ' file
00435   Description 3
00100   Description 1 | Description 4
00200   Description 2 | Description 2 | Description 3

Upvotes: 2

Related Questions