Reputation: 41
I've been trying all day to get this bracket content extracted and split but I just cant get it done. I've tried using sed and tr to replace the '[]' with \t, and do it step by step but no luck at all.
tr '[''\t'`
Even a friend tried with vi but it got too complicated and didnt work:
:%s/\([A-Za-z_]*\)\t\([0-9A-Z-]*\)\t\([0-9]*\)\t[A-Z]*\[\(.\).\(.\)\][A-Z]*\t+\([a-z0-9]*\)\t/\1\t\2\t\3\t\4\t\5\t\6\t\7/g
Also tried with python but it says there are too many values to unpack
It has to be an issue with the brackets or something like that. So, I have this table but containing hundreds of thousands lines
Species X-C982 282 AACTGTCCATTGACTCTGATAGTGTAAC[G/A]GAGGAAGATGTGCCTAAAAGGAAGAA scaffold7
Species X-A757 158 CCAAGACAGACAGTGGGGTAGAATTTAC[T/C]ACAACAGGCAGTCACAGTGACAAAGG scaffold7
Species X-G39 842 TGATGAACATCAGACTTTTAAACTTTGC[T/C]CATGCATAAATCTGTATATCACGCTA scaffold9
And I need to extract the content and split it from the '/' so it will look like this (all tab separated):
Species X-C982 282 G A scaffold7
Species X-A757 158 T C scaffold7
Species X-G39 842 T C scaffold9
Sorry for not posting any good code but none of them are working.
I'm aware this could be done quite easy in excel but when sometimes working with more than a million lines it's just not possible. Thanks in advance
Upvotes: 0
Views: 63
Reputation: 116957
If there's any doubt as to how many occurrences of "[X/Y]" there may be in the nucleotide sequence, then it would probably be better to check.
Assuming the input is tab-separated with $3 being the long nucleotide sequence, the following illustrates what could be done:
awk -F\\t '
BEGIN{OFS=FS}
$3 ~ /\[/ { split($3, a, "[][/]"); print $1,$2,a[2],a[3],$4; next}
{print $1,$2,"","",$4} '
Upvotes: 0
Reputation: 204456
$ awk -F'[][[:space:]/]+' -v OFS='\t' '{print $1, $2, $3, $5, $6, $8}' file
Species X-C982 282 G A scaffold7
Species X-A757 158 T C scaffold7
Species X-G39 842 T C scaffold9
If you're going to be doing any more text manipulation tasks in future, read the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
Upvotes: 3
Reputation: 89629
With awk, you can define the field separator like this:
awk -F'[] ][ACTG]*[[ ]|/' '$1=$1' file
Upvotes: 1