Reputation: 41

Splitting bracket content into new columns

I've been trying all day to get this bracket content extracted and split but I just cant get it done. I've tried using sed and tr to replace the '[]' with \t, and do it step by step but no luck at all.

tr '[''\t'`

Even a friend tried with vi but it got too complicated and didnt work:

:%s/\([A-Za-z_]*\)\t\([0-9A-Z-]*\)\t\([0-9]*\)\t[A-Z]*\[\(.\).\(.\)\][A-Z]*\t+\([a-z0-9]*\)\t/\1\t\2\t\3\t\4\t\5\t\6\t\7/g

Also tried with python but it says there are too many values to unpack

It has to be an issue with the brackets or something like that. So, I have this table but containing hundreds of thousands lines

Species X-C982  282 AACTGTCCATTGACTCTGATAGTGTAAC[G/A]GAGGAAGATGTGCCTAAAAGGAAGAA scaffold7
Species X-A757  158 CCAAGACAGACAGTGGGGTAGAATTTAC[T/C]ACAACAGGCAGTCACAGTGACAAAGG scaffold7
Species X-G39   842 TGATGAACATCAGACTTTTAAACTTTGC[T/C]CATGCATAAATCTGTATATCACGCTA scaffold9

And I need to extract the content and split it from the '/' so it will look like this (all tab separated):

Species X-C982  282  G  A  scaffold7
Species X-A757  158  T  C  scaffold7
Species X-G39   842  T  C  scaffold9

Sorry for not posting any good code but none of them are working.

I'm aware this could be done quite easy in excel but when sometimes working with more than a million lines it's just not possible. Thanks in advance

Upvotes: 0

Answers (3)

peak

Reputation: 116957

If there's any doubt as to how many occurrences of "[X/Y]" there may be in the nucleotide sequence, then it would probably be better to check.

Assuming the input is tab-separated with $3 being the long nucleotide sequence, the following illustrates what could be done:

 awk -F\\t '
   BEGIN{OFS=FS}
   $3 ~ /\[/ { split($3, a, "[][/]"); print $1,$2,a[2],a[3],$4; next}
   {print $1,$2,"","",$4} '

Upvotes: 0

Ed Morton

Reputation: 204456

$ awk -F'[][[:space:]/]+' -v OFS='\t' '{print $1, $2, $3, $5, $6, $8}' file
Species X-C982  282     G       A       scaffold7
Species X-A757  158     T       C       scaffold7
Species X-G39   842     T       C       scaffold9

If you're going to be doing any more text manipulation tasks in future, read the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

Upvotes: 3

Casimir et Hippolyte

Reputation: 89629

With awk, you can define the field separator like this:

awk -F'[] ][ACTG]*[[ ]|/' '$1=$1' file

Upvotes: 1

Splitting bracket content into new columns

Answers (3)

Related Questions