Reputation: 777
I have encounter this error when I try to modify the code where the chromosome name doesn't follow my input file chromosome name. Basically the code below is to read the input file and notify the location of shorter sequence and output the location sequence based on the information given in the file. For example in chr4: 154742507-154742714, the value 151 state the location of first bases and the 'CCCAGGCTGG' location is 173 - 182. Therefore using the code below should be able to return me the exact location by adding 173 to 154742507 and obtain the output below. Can anyone help me?
Below is the code with example input text file.
Input.txt
chr4:154742507-154742714
CCCAGGCTGG
151 AGTCTTGCTTTTTTTGTCGTTGCCCAGGCTGGAGTGCAGTGGCACCATCTCGGCTCAC
chr9:47303792-47303999
CCAGCCTGGG
1 TCCAGCCTGGGTGACAGCGTGAGGCTCTTGTCTCAAATAGAAAAAAAACAAAGAACAAAAAACAAAAAACCACCA
Output
chr1 154742680 154742690
chr1 47303794 47303804
Expected output
chr4 154742680 154742690
chr9 47303794 47303804
Code
import re # regular expressions, not needed (alternatives: the `split` method) but convenient
result = []
output_file=open('output.bed','w')
with open('Input.txt') as f:
for line in f:
if line.startswith('chr'):
label = line.strip()
elif line[0] == ' ':
# short sequence
length = len(line.strip())
# find the index of the beginning of the short sequence
for i, c in enumerate(line):
if c.isalpha():
short_index = i
break
elif line[0].isdigit():
# long sequence
n = line.split(' ')[0]
# find the index of the beginning of the long sequence
for i, c in enumerate(line):
if c.isalpha():
long_index = i
break
start = int(n) + short_index - long_index
start -= 1
end = start + length
result.append('{} {} {}'.format(label, start, end))
offset, n, start, length = 0, 0, 0, 0
output_line= "\n".join(result)
output_file.write(output_line)
output_file.close()
output_file=open('last_output.bed','w')
with open('output.bed') as fin:
for line in fin:
start, _, offset_start, offset_end = re.search(r'[^:]*:(\d+)\D+(\d+)\D+(\d+)\D+(\d+)', line).groups()
output_line=('chr1\t{}\t{}\n'.format(int(start) + int(offset_start) + 1,int(start) + int(offset_end) + 1))
output_file.write(output_line)
output_file.close()
Upvotes: 2
Views: 261
Reputation: 104792
If I understand the question correctly, the issue you're having has only to do with the chromosome number (chr##
) that is being outputted incorrectly.
This seems a bit obvious. At the end of your code, you're hard coding it:
output_line=('chr1\t{}\t{}\n'.format(stuff))
If you don't want the output to always show chr1
, you'll need to change that.
The regex on the previous line seems to be matching the chromosome number from the file, you're just not capturing it in a group you can use later. Try:
chromosome, start, _, offset_start, offset_end = re.search(r'([^:]*):(\d+)\D+(\d+)\D+(\d+)\D+(\d+)', line).groups()
output_line=('{}\t{}\t{}\n'.format(chromosome, int(start) + int(offset_start) + 1,int(start) + int(offset_end) + 1))
This is still rather ugly, but should work. Note that it would be a whole lot easier if you made the proper output from your initial loop, rather than writing out an intermediate format and then needing to reparse it.
Upvotes: 4