Splitting information in a specific column python?

Question

Here is an example of what my file looks like:

Type    Variant_class   ACC_NUM dbsnp   genomic_coordinates_hg18    genomic_coordinates_hg19    HGVS_cdna   HGVS_protein    gene    disease sequence_context_hg18   sequence_context_hg19   codon_change    codon_number    intron_number   site    location    location_reference_point    author  journal vol page    year    pmid    entrezid    sift_score  sift_prediction mutpred_score
1   DM  CM920001    rs1800433   null    chr12:9232351:- NM_000014.4 NP_000005.2:p.C972Y A2M Chronicobstructivepulmonarydisease  null    CACAAAATCTTCTCCAGATGCCCTATGGCT[G/A]TGGAGAGCAGAATATGGTCCTCTTTGCTCC   TGT-TAT 972 null    null    2   null    Poller  HUMGENET    88  313 1992    1370808 2   0   DAMAGING    0.594315245478036
1   DM  CM004784    rs74315453  null    chr22:43089410:-    NM_017436.4 NP_059132.1:p.M183K A4GALT  Pksynthasedeficiency(pphenotype)    null    TGCTCTCCGACGCCTCCAGGATCGCACTCA[T/A]GTGGAAGTTCGGCGGCATCTACCTGGACAC   ATG-AAG 183 null    null    2   null    Steffensen  JBC 275 16723   2000    10747952    53947   0   DAMAGING    0.787878787878788
1   DM  CM1210274   null    null    chr22:43089327:-    NM_017436.4 NP_059132.1:p.Q211E A4GALT  NORpolyagglutination    null    CTGCGGAACCTGACCAACGTGCTGGGCACC[C/G]AGTCCCGCTACGTCCTCAACGGCGCGTTCC   CAG-GAG 211 null    null    null    null    Suchanowska JBC 287 38220   2012    22965229    53947   0.79    TOLERATED   null

What I want to do is split the information in column 13 by the - mark. In my example file above, this column contains the data ATG-AAG and CAG-GAG. I would like to separate it with a tab separation.

I've tried my code below:

with open('disease_mut_split2.txt') as inf:
    with open('disease_mut_splitfinal.txt', 'w') as outf:
        for line in inf:
            outf.write('	'.join(line.split('-')))

However, this also splits the - in the 6 column, which I do not want. Is there any way I can specify the column to split with the code I have?

Wayne Werner · Accepted Answer

If you know it's always going to be at column 13, just use a slice:

'{}	{}'.format(line[:13], line[14:])

Alternatively, if you always know it's going to be the first thing you can limit the # of splits:

>>> x = 'this has - a few - dashes - in it'
>>> x.split('-', maxsplit=1)
['this has ', ' a few - dashes - in it']

If by column you mean that your data is a csv file (tab separated files work the same way), then Python's csv module will aid you:

with open('infile.txt') as f, open('outfile.txt', 'w') as outfile: 
    reader = csv.reader(f, delimiter='	')                                         
    writer = csv.writer(outfile, delimiter='	')                                   
    writer.writerow(next(reader, None))  # Write out the header row                
    for row in reader:   
        # Note: Python lists begin with [0], 
        #       so the 13th column will have an index of 12                                                          
        row[12] = row[12].replace('-', ' ')                                        
        writer.writerow(row)

Splitting information in a specific column python?

Answers (2)

Related Questions