Reputation: 53
I have a multiple sequence alignment (MSA) file derived from mafft in clustal format which I want to import into Python and save into a PDF file. I need to import the file and then highlight some specific words. I've tried to simply import the pdf of the MSA but after the highlight command doesn't work.
I need to print the file like this:
CLUSTAL format alignment by MAFFT FFT-NS-i (v7.453)
Consensus --------------------------------------acgttttcgatatttatgccat
AMP tttatattttctcctttttatgatggaacaagtctgcgacgttttcgatatttatgccat
**********************
Consensus atgtgcatgttgtaaggttgaaagcaaaaatgaggggaaaaaaaatgaggtttttaataa
AMP atgtgcatgttgtaaggttgaaagcaaaaatgaggggaaaaaaaatgaggtttttaataa
************************************************************
Consensus ctacacatttagaggtctaggaaataaaggagtattaccatggaaatgtatttccctaga
AMP ctacacatttagaggtctaggaaataaaggagtattaccatggaaatgtaattccctaga
************************************************** *********
Consensus tatgaaatattttcgtgcagttacaacatatgtgaatgaatcaaaatatgaaaaattgaa
AMP tatgaaatattttcgtgcagttacaacatatgtgaatgaatcaaaatatgaaaaattgaa
************************************************************
Consensus atataagagatgtaaatatttaaacaaagaaactgtggataatgtaaatgatatgcctaa
AMP atataagagatgtaaatatttaaacaaagaaactgtggataatgtaaatgatatgcctaa
************************************************************
Consensus ttctaaaaaattacaaaatgttgtagttatgggaagaacaaactgggaaagcattccaaa
AMP ttctaaaaaattacaaaatgttgtagttatgggaagaacaaactgggaaagcattccaaa
************************************************************
Consensus aaaatttaaacctttaagcaataggataaatgttatattgtctagaaccttaaaaaaaga
AMP aaaatttaaacctttaagcaataggataaatgttatattgtctagaaccttaaaaaaaga
************************************************************
Consensus agattttgatgaagatgtttatatcattaacaaagttgaagatctaatagttttacttgg
AMP agattttgatgaagatgtttatatcattaacaaagttgaagatctaatagttttacttgg
************************************************************
Consensus gaaattaaattactataaatgttttattataggaggttccgttgtttatcaagaattttt
AMP gaaattaaattactataaatgttttattataggaggttccgttgtttatcaagaattttt
************************************************************
Consensus agaaaagaaattaataaaaaaaatatattttactagaataaatagtacatatgaatgtga
AMP agaaaagaaattaataaaaaaaatatattttactagaataaatagtacatatgaatgtga
************************************************************
Consensus tgtattttttccagaaataaatgaaaatgagtatcaaattatttctgttagcgatgtata
AMP tgtattttttccagaaataaatgaaaatgagtatcaaattatttctgttagcgatgtata
************************************************************
Consensus tactagtaacaatacaacattgga----------------------------------
AMP tactagtaacaatacaacattggattttatcatttataagaaaacgaataataaaatg
************************
How can I import the alignment and print in the new PDF with the right alignment of the sequences.
Thanks
Upvotes: 1
Views: 759
Reputation: 3041
adding, a second answer because the OP request about
I need to import the file and then highlight some specific word
was making me uneasy. I kept using the Pyfpdf2
library and made use of the Biopython SeqRecord.letter_annotations
attribute to track Nucleotides index composing the words found by RegEx. Tried to use the page.search_for()
of the PyMuPDF
library directly on the output pdf of my previous answer but was not getting word accross line breaks [if anybody know better, please help].
Here my code, it's a two half problem, first find the word (in my example is 'atat'), then write everything to pdf, I stole it from from /master/Bio/AlignIO/ClustalIO.py/ def write_alignment(self, alignment): from class ClustalWriter(SequentialAlignmentWriter) with some modding [I believe but not absolutely sure that ...A cell by definition is uniform in font and color... ]:
import re
from Bio import AlignIO # Biopython 1.80
from fpdf import FPDF # pip install fpdf2
alignment = AlignIO.read("Multi.txt", "clustal")
word_to_find = 'atat'
for rec in alignment:
rec.letter_annotations['highlight'] = 'N'*len(rec.seq)
# for rec in alignment:
# print('rec.id : ', rec.id, 'highlight : ', rec.letter_annotations['highlight'])
for rec in alignment:
# print(str(rec.seq))
found = [m.start() for m in re.finditer( '(?='+re.escape(word_to_find)+')' , re.escape(str(rec.seq.replace('-', 'Z'))))]
# print(rec.id , ' found : ', found)
for i in found :
a = list(rec.letter_annotations['highlight'])
a[i:i+len(word_to_find)] = 'Y'*len(word_to_find)
rec.letter_annotations['highlight'] = ''.join(i for i in a)
# print(rec.letter_annotations['highlight'])
max_length = alignment.get_alignment_length()
# stolen from /master/Bio/AlignIO/ClustalIO.py/ def write_alignment(self, alignment): from class ClustalWriter(SequentialAlignmentWriter):
def write_alignment(alignment):
x = 2
y = 2
pdf = FPDF(orientation="P", unit="mm", format="A4")
# Add a page
pdf.add_page()
pdf.add_font('FreeMono', '', 'FreeMono.ttf')
pdf.set_font("FreeMono", size = 12)
pdf.set_fill_color(255, 255, 0)
"""Use this to write (another) single alignment to an open file."""
if len(alignment) == 0:
raise ValueError("Must have at least one sequence")
if alignment.get_alignment_length() == 0:
# This doubles as a check for an alignment object
raise ValueError("Non-empty sequences are required")
# Old versions of the parser in Bio.Clustalw used a ._version property
try:
version = str(alignment._version)
except AttributeError:
version = ""
if not version:
version = "1.81"
if version.startswith("2."):
# e.g. 2.0.x
output = f"CLUSTAL {version} multiple sequence alignment"
else:
# e.g. 1.81 or 1.83
output = f"CLUSTAL X ({version}) multiple sequence alignment"
pdf.cell(0, y, txt = output , border = 0, new_x='LMARGIN' , new_y="NEXT" , align = 'L', fill = False)
pdf.cell(0, 4*y, txt = '' , border = 0, new_x='LMARGIN' , new_y="NEXT" , align = 'L', fill = False)
cur_char = 0
# # max_length = len(alignment[0])
max_length = alignment.get_alignment_length()
if max_length <= 0:
raise ValueError("Non-empty sequences are required")
if "clustal_consensus" in alignment.column_annotations:
star_info = alignment.column_annotations["clustal_consensus"]
else:
try:
# This was originally stored by Bio.Clustalw as ._star_info
star_info = alignment._star_info
except AttributeError:
star_info = None
# keep displaying sequences until we reach the end
rec_cnt = 0
while cur_char != max_length:
# calculate the number of sequences to show, which will
# be less if we are at the end of the sequence
if (cur_char + 50) > max_length:
show_num = max_length - cur_char
else:
show_num = 50
# go through all of the records and print out the sequences
# when we output, we do a nice 80 column output, although this
# may result in truncation of the ids.
for record in alignment:
rec_cnt += 1
# Make sure we don't get any spaces in the record
# identifier when output in the file by replacing
# them with underscores:
line = record.id[0:30].replace(" ", "_").ljust(50)
pdf.cell(50, y, txt = line, border = 0, new_x='RIGHT' , align = 'L', fill = False)
line_seq = list(str(record.seq[cur_char : (cur_char + show_num)]))
line_seq_highlight = list(str(record.letter_annotations['highlight'][cur_char : (cur_char + show_num)]))
for i in range(len(line_seq)):
if line_seq_highlight[i] == 'N' :
filling = False
if line_seq_highlight[i] == 'Y' :
filling = True
else:
filling = False
pdf.cell(x, y, txt = line_seq[i], border = 0, new_x= 'RIGHT' , align = 'C', fill = filling)
pdf.cell(x, 2*y, txt = ' ' , border = 0, new_x="LMARGIN" , new_y="NEXT" , align = 'L', fill = False)
# now we need to print out the star info, if we've got it
if star_info and rec_cnt == len(alignment):
pdf.cell(x, 2*y, txt = ' ' , border = 0, new_x="LMARGIN" , new_y="NEXT" , align = 'L', fill = False)
rec_cnt = 0
# print(star_info)
line = (' ' * 50)
pdf.cell(50, y, txt = line, border = 0, new_x='RIGHT' , align = 'L', fill = False)
star_info_seq = list(str(star_info[cur_char : (cur_char + show_num)]))
for i in range(len(star_info_seq)):
pdf.cell(x, y, txt = star_info_seq[i], border = 0, new_x= 'RIGHT' , align = 'C', fill = False)
pdf.cell(x, 2*y, txt = ' ' , border = 0, new_x="LMARGIN" , new_y="NEXT" , align = 'L', fill = False)
cur_char += show_num
pdf.cell(x, 3*y, txt = ' ' , border = 0, new_x="LMARGIN" , new_y="NEXT" , align = 'L', fill = False)
pdf.output("out.pdf")
write_alignment(alignment)
I tested it, with the Multi.txt
Clustal alignment from above, output is
out.pdf
:
.
One last note:
needed to replace -
with Z
in the Consensus string to be able to get reliable results from the RegEX line : found = [m.start() for m in re.finditer( '(?='+re.escape(word_to_find)+')' , re.escape(str(rec.seq.replace('-', 'Z'))))]
, not sure what is going on here, got the snippet from a SO post.
Upvotes: 1
Reputation: 3041
Ok, figured out a way, not sure its the best one,
nedd to install fpdf2
(pip install fpdf2
)
from io import StringIO
from Bio import AlignIO # Biopython 1.80
from fpdf import FPDF # pip install fpdf2
alignment = AlignIO.read("Multi.txt", "clustal")
stri = StringIO()
AlignIO.write(alignment, stri, 'clustal' )
# print(stri.getvalue())
stri_lines = [ i for i in stri.getvalue().split('\n')]
# print(stri_lines)
pdf = FPDF(orientation="P", unit="mm", format="A4")
# Add a page
pdf.add_page()
pdf.add_font('FreeMono', '', 'FreeMono.ttf')
pdf.set_font("FreeMono", size = 8)
for x in stri_lines:
pdf.cell(0, 5, txt = x, border = 0, new_x="LMARGIN" , new_y="NEXT", align = 'L', fill = False)
# print(len(x))
pdf.output("out.pdf")
output pdf out.pdf
:
Not sure why the file Header is changed, think is something within Biopythion (!!! ???) you can check adding:
with open('file_output.txt', 'w') as filehandler:
AlignIO.write(alignment, filehandler, 'clustal')
I had to place a tff FreeMono font (Mono spaced font in my script directory) see: pdf.add_font('FreeMono', '', 'FreeMono.ttf')
otherwise the alignement won't be printed in the correct way
Which fonts have the same width for every character?.
Attached a png of my pdf. See that you can highlight it,
using:
pdf.set_fill_color(255, 255, 0)
filling = False
for x in stri_lines:
if 'Consensus' in x:
filling = True
else:
filling = False
pdf.cell(0, 5, txt = x, border = 0, new_x="LMARGIN" , new_y="NEXT", align = 'L', fill = filling)
or something similar you can highlight while printing:
Upvotes: 1