Reputation: 4274
I have a file like this:
gene_id,transcript_id(s),length,effective_length,expected_count,TPM,FPKM,id
ENSG00000000003.14,ENST00000373020.8,ENST00000494424.1,ENST00000496771.5,ENST00000612152.4,ENST00000614008.4,2.23231E3,2.05961E3,2493,2.112E1,1.788E1,00065a62-5e18-4223-a884-12fca053a109
ENSG00000001084.10,ENST00000229416.10,ENST00000504353.1,ENST00000504525.1,ENST00000505197.1,ENST00000505294.5,ENST00000509541.5,ENST00000510837.5,ENST00000513939.5,ENST00000514004.5,ENST00000514373.2,ENST00000514933.1,ENST00000515580.1,ENST00000616923.4,3.09456E3,2.92186E3,3111,1.858E1,1.573E1,00065a62-5e18-4223-a884-12fca053a109
The problem is that instead of ,
, the file should've been tab delimited because the values starting from ENST
(i.e. transcript_id(s)
) are grouped in one column.
The number of ENST IDs
is different in each line.
Each ENST ID
has the same pattern: starts from ENST, followed by 11 digits followed by a period and then 1-3 digits: ^ENST[0-9]{11}[.][0-9]{1,3}
.
I want to convert all the comma's between ENST ids to a :
or any other character to read this as a csv file. Any help would be much appreciated. Thanks!
Upvotes: 0
Views: 192
Reputation: 3049
I imagine something as simple as
sed 's|,ENST|:ENST|g;s|:|,|' < /path/to/your/file
should work. No reason to over-complicate.
Upvotes: 4