cosmictypist
cosmictypist

Reputation: 575

Splitting information in a specific column python?

Here is an example of what my file looks like:

Type    Variant_class   ACC_NUM dbsnp   genomic_coordinates_hg18    genomic_coordinates_hg19    HGVS_cdna   HGVS_protein    gene    disease sequence_context_hg18   sequence_context_hg19   codon_change    codon_number    intron_number   site    location    location_reference_point    author  journal vol page    year    pmid    entrezid    sift_score  sift_prediction mutpred_score
1   DM  CM920001    rs1800433   null    chr12:9232351:- NM_000014.4 NP_000005.2:p.C972Y A2M Chronicobstructivepulmonarydisease  null    CACAAAATCTTCTCCAGATGCCCTATGGCT[G/A]TGGAGAGCAGAATATGGTCCTCTTTGCTCC   TGT-TAT 972 null    null    2   null    Poller  HUMGENET    88  313 1992    1370808 2   0   DAMAGING    0.594315245478036
1   DM  CM004784    rs74315453  null    chr22:43089410:-    NM_017436.4 NP_059132.1:p.M183K A4GALT  Pksynthasedeficiency(pphenotype)    null    TGCTCTCCGACGCCTCCAGGATCGCACTCA[T/A]GTGGAAGTTCGGCGGCATCTACCTGGACAC   ATG-AAG 183 null    null    2   null    Steffensen  JBC 275 16723   2000    10747952    53947   0   DAMAGING    0.787878787878788
1   DM  CM1210274   null    null    chr22:43089327:-    NM_017436.4 NP_059132.1:p.Q211E A4GALT  NORpolyagglutination    null    CTGCGGAACCTGACCAACGTGCTGGGCACC[C/G]AGTCCCGCTACGTCCTCAACGGCGCGTTCC   CAG-GAG 211 null    null    null    null    Suchanowska JBC 287 38220   2012    22965229    53947   0.79    TOLERATED   null

What I want to do is split the information in column 13 by the - mark. In my example file above, this column contains the data ATG-AAG and CAG-GAG. I would like to separate it with a tab separation.

I've tried my code below:

with open('disease_mut_split2.txt') as inf:
    with open('disease_mut_splitfinal.txt', 'w') as outf:
        for line in inf:
            outf.write('\t'.join(line.split('-')))

However, this also splits the - in the 6 column, which I do not want. Is there any way I can specify the column to split with the code I have?

Upvotes: 0

Views: 69

Answers (2)

Julian
Julian

Reputation: 2634

Assuming what you're doing is in fact parsing/formatting a csv file Wayne Werner's csv module approach is probably the most robust way to solve this.

As an alternative, you might consider using re.sub from the re module. The exact regex to use will depend on the data. If, for instance that column is always 3 nucleotides, - and 3 nucleotides, something like this might work:

re.sub(r'(?<=[ACTG]{3})-(?=[ACTG]{3})', '\t', line))

The regex uses lookbehind and lookahead to replace a - between two sets of 3 nucleotides, so assuming that sort of pattern doesn't appear elsewhere in your file should work well.

EDIT: Changed to re.sub For some reason the original code just had me in a split mindset!

Upvotes: 1

Wayne Werner
Wayne Werner

Reputation: 51837

If you know it's always going to be at column 13, just use a slice:

'{}\t{}'.format(line[:13], line[14:])

Alternatively, if you always know it's going to be the first thing you can limit the # of splits:

>>> x = 'this has - a few - dashes - in it'
>>> x.split('-', maxsplit=1)
['this has ', ' a few - dashes - in it']

If by column you mean that your data is a csv file (tab separated files work the same way), then Python's csv module will aid you:

with open('infile.txt') as f, open('outfile.txt', 'w') as outfile: 
    reader = csv.reader(f, delimiter='\t')                                         
    writer = csv.writer(outfile, delimiter='\t')                                   
    writer.writerow(next(reader, None))  # Write out the header row                
    for row in reader:   
        # Note: Python lists begin with [0], 
        #       so the 13th column will have an index of 12                                                          
        row[12] = row[12].replace('-', ' ')                                        
        writer.writerow(row)

Upvotes: 3

Related Questions