Reputation: 8264
I am having an issue with Unicode
with a variable contents when writing to a .pdf with python.
It's outputting this error:
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013'
Which is it getting caught on an em dash basically.
I have tried taking that variable, where the contents has an 'em dash' and redefined it with an '.encode('utf-8')
' for example, i.e., below:
Body = msg.Body
BodyC = Body.encode('utf-8')
And now I get the below error:
Traceback (most recent call last):
File "script.py", line 37, in <module>
pdf.cell(200, 10, txt="Bod: " + BodyC, ln=4, align="C")
TypeError: can only concatenate str (not "bytes") to str
Below is my full code, how could I simply fix my Unicode error in 'Body
' variable contents.
Converting to utf-8
or western
, anything outside of 'latin-1
'. Any suggestions?
Full Code:
from fpdf import FPDF
import win32com.client
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
msg = outlook.OpenSharedItem(r"C:\User\language\python\Msg-To-PDF\test_msg.msg")
print (msg.SenderName)
print (msg.SenderEmailAddress)
print (msg.SentOn)
print (msg.To)
print (msg.CC)
print (msg.BCC)
print (msg.Subject)
print (msg.Body)
SenderName = msg.SenderName
SenderEmailAddress = msg.SenderEmailAddress
SentOn = msg.SentOn
To = msg.To
CC = msg.CC
BCC = msg.BCC
Subject = msg.Subject
Body = msg.Body
BodyC = Body.encode('utf-8')
pdf = FPDF()
pdf.add_page()
# pdf.add_font('DejaVu', '', 'DejaVuSansCondensed.ttf', uni=True)
pdf.set_font("Helvetica", style = '', size = 11)
pdf.cell(200, 10, txt="From: " + SenderName, ln=1, align="C")
# pdf.cell(200, 10, border=SentOn, ln=1, align="C")
pdf.cell(200, 10, txt="To: " + To, ln=1, align="C")
pdf.cell(200, 10, txt="CC: " + CC, ln=1, align="C")
pdf.cell(200, 10, txt="BCC: " + BCC, ln=1, align="C")
pdf.cell(200, 10, txt="Subject: " + Subject, ln=1, align="C")
pdf.cell(200, 10, txt="Bod: " + BodyC, ln=4, align="C")
pdf.output("Sample.pdf")
'latin1'
?Upvotes: 10
Views: 27508
Reputation: 358
I was trying Erik's solution with some changes, works great with a mix of English and Arabic text. Sample code posted below to generate PDF using pyFPDF.
from datetime import datetime
def getFileName():
now=datetime.now()
time = now.strftime('%d_%H_%M_%S')
filename = "Test_"+time + ".pdf"
return filename
from fpdf import FPDF
pdf = FPDF()
#Download NotoSansArabic-Regular.ttf from Google noto fonts
pdf.add_font("NotoSansArabic", style="", fname="./fonts/NotoSansArabic-Regular.ttf", uni=True)
pdf.add_page()
pdf.set_font('Arial', '', 12)
pdf.write(8, 'Hello World')
pdf.ln(8)
# مرحبا Marhaba in arabic
pdf.set_font('NotoSansArabic', '', 12)
text = 'مرحبا'
pdf.write(8, text)
pdf.ln(8)
pdf.output(getFileName(), 'F')
Upvotes: 0
Reputation: 39
You can also change the encoding through the .set_doc_option()
method (documentation here). I tried Erik's method, which worked for me, but then after adding some more complexities (such as a second PDF and using the write_html() method which required creating a new class), I went back to having the same error. Changing the encoding for the whole document should solve the overall problem as you said.
The readthedocs page says you can only use latin-1 or windows-1252, but pdf.set_doc_option('core_fonts_encoding', 'utf-8')
worked for me according to the debugger. Just be aware that some characters will need fixing, like the apostrophe (') showing as â€ÂTM in the PDF.
Hope this is the global fix for this issue you were looking for, even if several months late!
Upvotes: 1
Reputation: 32697
The reason for this error is that you are trying to render a character in your PDF that is outside the code range of latin-1
encoding. FPDF uses latin-1
as default encoding for all its build-in fonts.
So as a workaround you can just remove all characters from your text that do not fit into latin-1
encoding. (see my other answer for this workaround).
To fix this error and be able to render those characters in your PDF you need to use fonts that support a wider range of characters. To address this the FPDF library supports Unicode font.
For example you could get the free Google Noto fonts, which support a wide range of Unicode endpoints. For most western languages I would recommend the NotoSans font set. But you can also get fonts for many other languages and scripts including Chinese, Hebrew or Arabic.
Here is how to enable the Unicode fonts in your code for FPDF:
First you need to tell FPDF library where it can find the font files. In this example I am setting it to the sub-folder fonts
of the current folder.
import fpdf
fpdf.set_global("SYSTEM_TTFONTS", os.path.join(os.path.dirname(__file__),'fonts'))
Then you need to add the fonts to your PDF document. In this example I am adding the NotoSans fonts for the styles normal, bold, italic and bold-italic:
pdf = fpdf.FPDF()
pdf.add_font("NotoSans", style="", fname="NotoSans-Regular.ttf", uni=True)
pdf.add_font("NotoSans", style="B", fname="NotoSans-Bold.ttf", uni=True)
pdf.add_font("NotoSans", style="I", fname="NotoSans-Italic.ttf", uni=True)
pdf.add_font("NotoSans", style="BI", fname="NotoSans-BoldItalic.ttf", uni=True)
Now you can use the new fonts normally in your PDF document with set_font()
. Here is an example for normal text:
pdf.set_font("NotoSans", size=12)
Upvotes: 7
Reputation: 32697
A workaround is to convert all text to latin-1 encoding before passing it on to the library. You can do that with the following command:
text2 = text.encode('latin-1', 'replace').decode('latin-1')
text2
will be free of any non-latin-1 characters. However, some chars may be replaced with ?
Upvotes: 22