Xiong89
Xiong89

Reputation: 777

Python: How to output the correct chromosome name in the output?

I have encounter this error when I try to modify the code where the chromosome name doesn't follow my input file chromosome name. Basically the code below is to read the input file and notify the location of shorter sequence and output the location sequence based on the information given in the file. For example in chr4: 154742507-154742714, the value 151 state the location of first bases and the 'CCCAGGCTGG' location is 173 - 182. Therefore using the code below should be able to return me the exact location by adding 173 to 154742507 and obtain the output below. Can anyone help me?

Below is the code with example input text file.

Input.txt

chr4:154742507-154742714


                           CCCAGGCTGG
151  AGTCTTGCTTTTTTTGTCGTTGCCCAGGCTGGAGTGCAGTGGCACCATCTCGGCTCAC


chr9:47303792-47303999

      CCAGCCTGGG
1    TCCAGCCTGGGTGACAGCGTGAGGCTCTTGTCTCAAATAGAAAAAAAACAAAGAACAAAAAACAAAAAACCACCA

Output

chr1    154742680   154742690
chr1    47303794    47303804

Expected output

chr4    154742680   154742690
chr9    47303794    47303804

Code

import re  # regular expressions, not needed (alternatives: the `split` method) but convenient

result = []
output_file=open('output.bed','w')
with open('Input.txt') as f:
    for line in f:
        if line.startswith('chr'):
            label = line.strip()
        elif line[0] == ' ':
            # short sequence
            length = len(line.strip())
            # find the index of the beginning of the short sequence
            for i, c in enumerate(line):
                if c.isalpha():
                    short_index = i
                    break
        elif line[0].isdigit():
            # long sequence
            n = line.split(' ')[0]
            # find the index of the beginning of the long sequence
            for i, c in enumerate(line):
                if c.isalpha():
                    long_index = i
                    break
            start = int(n) + short_index - long_index
            start -= 1
            end = start + length
            result.append('{} {} {}'.format(label, start, end))
            offset, n, start, length = 0, 0, 0, 0
output_line= "\n".join(result)
output_file.write(output_line)
output_file.close()

output_file=open('last_output.bed','w')
with open('output.bed') as fin:
    for line in fin:
        start, _, offset_start, offset_end = re.search(r'[^:]*:(\d+)\D+(\d+)\D+(\d+)\D+(\d+)', line).groups()
        output_line=('chr1\t{}\t{}\n'.format(int(start) + int(offset_start) + 1,int(start) + int(offset_end) + 1))
        output_file.write(output_line)
output_file.close()

Upvotes: 2

Views: 261

Answers (1)

Blckknght
Blckknght

Reputation: 104792

If I understand the question correctly, the issue you're having has only to do with the chromosome number (chr##) that is being outputted incorrectly.

This seems a bit obvious. At the end of your code, you're hard coding it:

output_line=('chr1\t{}\t{}\n'.format(stuff))

If you don't want the output to always show chr1, you'll need to change that.

The regex on the previous line seems to be matching the chromosome number from the file, you're just not capturing it in a group you can use later. Try:

chromosome, start, _, offset_start, offset_end = re.search(r'([^:]*):(\d+)\D+(\d+)\D+(\d+)\D+(\d+)', line).groups()
output_line=('{}\t{}\t{}\n'.format(chromosome, int(start) + int(offset_start) + 1,int(start) + int(offset_end) + 1))

This is still rather ugly, but should work. Note that it would be a whole lot easier if you made the proper output from your initial loop, rather than writing out an intermediate format and then needing to reparse it.

Upvotes: 4

Related Questions