Reputation: 575
Here is an example of what my file looks like:
Type Variant_class ACC_NUM dbsnp genomic_coordinates_hg18 genomic_coordinates_hg19 HGVS_cdna HGVS_protein gene disease sequence_context_hg18 sequence_context_hg19 codon_change codon_number intron_number site location location_reference_point author journal vol page year pmid entrezid sift_score sift_prediction mutpred_score
1 DM CM920001 rs1800433 null chr12:9232351:- NM_000014.4 NP_000005.2:p.C972Y A2M Chronicobstructivepulmonarydisease null CACAAAATCTTCTCCAGATGCCCTATGGCT[G/A]TGGAGAGCAGAATATGGTCCTCTTTGCTCC TGT-TAT 972 null null 2 null Poller HUMGENET 88 313 1992 1370808 2 0 DAMAGING 0.594315245478036
1 DM CM004784 rs74315453 null chr22:43089410:- NM_017436.4 NP_059132.1:p.M183K A4GALT Pksynthasedeficiency(pphenotype) null TGCTCTCCGACGCCTCCAGGATCGCACTCA[T/A]GTGGAAGTTCGGCGGCATCTACCTGGACAC ATG-AAG 183 null null 2 null Steffensen JBC 275 16723 2000 10747952 53947 0 DAMAGING 0.787878787878788
1 DM CM1210274 null null chr22:43089327:- NM_017436.4 NP_059132.1:p.Q211E A4GALT NORpolyagglutination null CTGCGGAACCTGACCAACGTGCTGGGCACC[C/G]AGTCCCGCTACGTCCTCAACGGCGCGTTCC CAG-GAG 211 null null null null Suchanowska JBC 287 38220 2012 22965229 53947 0.79 TOLERATED null
What I want to do is split the information in column 13 by the -
mark. In my example file above, this column contains the data ATG-AAG and CAG-GAG. I would like to separate it with a tab separation.
I've tried my code below:
with open('disease_mut_split2.txt') as inf:
with open('disease_mut_splitfinal.txt', 'w') as outf:
for line in inf:
outf.write('\t'.join(line.split('-')))
However, this also splits the -
in the 6 column, which I do not want. Is there any way I can specify the column to split with the code I have?
Upvotes: 0
Views: 69
Reputation: 2634
Assuming what you're doing is in fact parsing/formatting a csv file Wayne Werner's csv
module approach is probably the most robust way to solve this.
As an alternative, you might consider using re.sub
from the re module. The exact regex to use will depend on the data. If, for instance that column is always 3 nucleotides, -
and 3 nucleotides, something like this might work:
re.sub(r'(?<=[ACTG]{3})-(?=[ACTG]{3})', '\t', line))
The regex uses lookbehind and lookahead to replace a -
between two sets of 3 nucleotides, so assuming that sort of pattern doesn't appear elsewhere in your file should work well.
EDIT: Changed to re.sub
For some reason the original code just had me in a split
mindset!
Upvotes: 1
Reputation: 51837
If you know it's always going to be at column 13, just use a slice:
'{}\t{}'.format(line[:13], line[14:])
Alternatively, if you always know it's going to be the first thing you can limit the # of splits:
>>> x = 'this has - a few - dashes - in it'
>>> x.split('-', maxsplit=1)
['this has ', ' a few - dashes - in it']
If by column you mean that your data is a csv file (tab separated files work the same way), then Python's csv module will aid you:
with open('infile.txt') as f, open('outfile.txt', 'w') as outfile:
reader = csv.reader(f, delimiter='\t')
writer = csv.writer(outfile, delimiter='\t')
writer.writerow(next(reader, None)) # Write out the header row
for row in reader:
# Note: Python lists begin with [0],
# so the 13th column will have an index of 12
row[12] = row[12].replace('-', ' ')
writer.writerow(row)
Upvotes: 3