EMMA_Y
EMMA_Y

Reputation: 45

Split the column and attach the new column to the table separately

I have a table called baitmap.txt. In $5, different gene names are separated by ",".

baitmap.txt

1   831895  848168  218 RP11-54O7.1-001
1   848169  850618  219 RP11-54O7.2-001
1   850619  874081  220 SAMD11-011,SAMD11-003,SAMD11-010,SAMD11-001,SAMD11-004
1   889424  903640  223 NOC2L-001
1   903641  927394  224 C1orf170-001,C1orf170-201
1   927395  936954  225 HES4-002,HES4-001,HES4-004
1   943677  957199  228 RP11-54O7.11-001
1   1005127 1034268 234 RNF223-201
1   1049052 1062659 239 C1orf159-002,C1orf159-001,C1orf159-004,C1orf159-009,C1orf159-011,C1orf159-017,C1orf159-016,C1orf159-203,C1orf159-201,C1orf159-204,C1orf159-202
1   1096739 1107115 246 MIR200B-201

I would like to split the names in $5 then merge the new column to the previous 4 columns separately. So it should look like this:

1   831895  848168  218 RP11-54O7.1-001
1   848169  850618  219 RP11-54O7.2-001
1   850619  874081  220 SAMD11-011
1   850619  874081  220 SAMD11-003
1   850619  874081  220 SAMD11-010
1   850619  874081  220 SAMD11-001
1   850619  874081  220 SAMD11-004
1   889424  903640  223 NOC2L-001
1   903641  927394  224 C1orf170-001
1   903641  927394  224 C1orf170-201

Upvotes: 2

Views: 67

Answers (5)

Ivan
Ivan

Reputation: 7277

Another bash variant

while read line; do
    first="${line% *}"
    last="${line##* }"
    printf "$first ${last//,/\\n$first }\n"
done < baitmap.txt

Upvotes: 0

William Pursell
William Pursell

Reputation: 212248

Since this question was originally tagged C, it seems appropriate to do:

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>

static void
process_line(const char *line)
{
        const char *b = line;
        int p = 1;
        for(int field = 0; *b; b++) {
                if( p && !isspace(*b) && ++field == 5 ) {
                        break;
                }
                p = isspace(*b);
        }
        int len = b - line;
        const char *s = b;
        for( ; *b; b++ ) {
                if( *b == ',' || *b == '\n' ) {
                        fwrite(line, 1, len, stdout);
                        fwrite(s, 1, b - s, stdout);
                        putchar('\n');
                        s = b + 1;
                }
        }
}

int
main(int argc, char *const argv[])
{
        (void)argc;
        (void)argv;
        char *line = NULL;
        size_t s;
        while( getline(&line, &s, stdin) != -1 ) {
                process_line(line);
        }
        free(line);
        return 0;
}

Upvotes: 0

glenn jackman
glenn jackman

Reputation: 246799

With plain bash:

while read -ra fields; do
    # split the last field into the "values" array
    IFS=, read -ra values <<< "${fields[-1]}"
    for val in "${values[@]}"; do
        fields[-1]=$val
        echo "${fields[*]}"
    done
done < baitmap.txt

This requires bash v4.3+ to assign to fields[-1]. To work with an older bash (such as /bin/bash on a Mac):

while read -ra fields; do
    len=${#fields[@]}
    IFS=, read -ra values <<< "${fields[len-1]}"
    for val in "${values[@]}"; do
        fields[len-1]=$val
        echo "${fields[*]}"
    done
done < baitmap.txt

Upvotes: 2

William Pursell
William Pursell

Reputation: 212248

awk '{split($5, a, ","); $5=""; for( k in a) print $0, a[k]}' baitmap.txt

Note that awk hashes the array so that will permute the order. If that matters, you can do:

awk '{n = split($5, a, ","); $5=""; for(k=1; k<=n; k++) print $0, a[k]}' baitmap.txt

Upvotes: 5

abelenky
abelenky

Reputation: 64682

I did this in a few lines of BAT (Dos / Windows only)

@echo off
for /F "tokens=1-5" %%a in (baitmap.txt) do (
    for %%i in (%%e) do (
        echo %%a %%b %%c %%d %%i
    )
)

Output

1 831895 848168 218 RP11-54O7.1-001
1 848169 850618 219 RP11-54O7.2-001
1 850619 874081 220 SAMD11-011
1 850619 874081 220 SAMD11-003
1 850619 874081 220 SAMD11-010
1 850619 874081 220 SAMD11-001
1 850619 874081 220 SAMD11-004
1 889424 903640 223 NOC2L-001
1 903641 927394 224 C1orf170-001
1 903641 927394 224 C1orf170-201
1 927395 936954 225 HES4-002
1 927395 936954 225 HES4-001
1 927395 936954 225 HES4-004
1 943677 957199 228 RP11-54O7.11-001
1 1005127 1034268 234 RNF223-201
1 1049052 1062659 239 C1orf159-002
1 1049052 1062659 239 C1orf159-001
1 1049052 1062659 239 C1orf159-004
1 1049052 1062659 239 C1orf159-009
1 1049052 1062659 239 C1orf159-011
1 1049052 1062659 239 C1orf159-017
1 1049052 1062659 239 C1orf159-016
1 1049052 1062659 239 C1orf159-203
1 1049052 1062659 239 C1orf159-201
1 1049052 1062659 239 C1orf159-204
1 1049052 1062659 239 C1orf159-202
1 1096739 1107115 246 MIR200B-201

Upvotes: 0

Related Questions