a.k
a.k

Reputation: 7

repeating row by splitting particular columns value

folks, I have a file like this:

Sequence ID TFBS_ID Binding sequence    TF_family   TF ID
CaCLV3_1    TFmatrixID_0009 taaaTTATTt  AT-Hook AT4G35390
CaCLV3_1    TFmatrixID_0009 aAATAAatat  AT-Hook AT4G35390
CaCLV3_1    TFmatrixID_0022 atcGGTAAct  Trihelix    AT5G28300
CaCLV3_1    TFmatrixID_0025 tcAATCAatt  Homeodomain;bZIP;HD-ZIP AT3G61890

I want to repeat entire row by splitting TF_family column which has more than single family separated by ";" and I want output like this, any help:

Sequence ID TFBS_ID Binding sequence    TF_family   TF ID
CaCLV3_1    TFmatrixID_0009 taaaTTATTt  AT-Hook AT4G35390
CaCLV3_1    TFmatrixID_0009 aAATAAatat  AT-Hook AT4G35390
CaCLV3_1    TFmatrixID_0022 atcGGTAAct  Trihelix    AT5G28300
CaCLV3_1    TFmatrixID_0025 tcAATCAatt  Homeodomain AT3G61890
CaCLV3_1    TFmatrixID_0025 tcAATCAatt  bZIP    AT3G61890
CaCLV3_1    TFmatrixID_0025 tcAATCAatt  HD-ZIP  AT3G61890

Upvotes: 0

Views: 40

Answers (1)

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

awk approach:

awk 'NR==1{print}NR>1{split($4,a,";"); for(i=1;i<=length(a);i++){$4=a[i]; print $0}}' file

The output:

Sequence ID TFBS_ID Binding sequence    TF_family   TF ID
CaCLV3_1 TFmatrixID_0009 taaaTTATTt AT-Hook AT4G35390
CaCLV3_1 TFmatrixID_0009 aAATAAatat AT-Hook AT4G35390
CaCLV3_1 TFmatrixID_0022 atcGGTAAct Trihelix AT5G28300
CaCLV3_1 TFmatrixID_0025 tcAATCAatt Homeodomain AT3G61890
CaCLV3_1 TFmatrixID_0025 tcAATCAatt bZIP AT3G61890
CaCLV3_1 TFmatrixID_0025 tcAATCAatt HD-ZIP AT3G61890

  • NR==1{print} - print the first line as is

  • split($4,a,";") - split the 4th field by ;

  • for(i=1;i<=length(a);i++){$4=a[i]; print $0} - repeat currect record for each subvalue

Upvotes: 1

Related Questions