Reputation: 383
I am using Python codecs to write some UTF-8 text to a file
#-*-coding:utf-8-*-
import codecs
filename = 'afile'
with codecs.open(filename, encoding='utf-8', mode='w') as fw :
fw.write('<DOC>\n<DOCNO>')
fw.write(filename)
fw.write('</DOCNO>\n<TEXT>\n')
fw.write('কাজ'.decode('utf-8'))
fw.write('\n</TEXT>\n</DOC>')
Now if I run Lemur (http://www.lemurproject.org/) on the directory with this file, Lemur tells me the document is 'malformed'.
0:00: Opened /home/userA/Documents/test_corpus/afile
0:00: Error in /home/userA/Documents/test_corpus/afile : ../src/TaggedDocumentIterator.cpp(213): Malformed document: /home/userA/Documents/test_corpus/afile
BUT, if I open the file in gedit, add a random character and delete it (so that the file content remains the same) and then save the file, THEN if I run Lemur, it runs perfectly.
0:00: Opened /home/userA/Documents/test_corpus/afile
0:00: Documents parsed: 1 Documents indexed: 1
0:00: Closed /home/userA/Documents/test_corpus/afile
So is there a difference in the way a text file is being saved, by Python and by gedit, due to which Lemur is responding differently in the two different scenarios?
Upvotes: 0
Views: 93
Reputation: 1123490
You are writing an output file without a newline on the last line:
fw.write('\n</TEXT>\n</DOC>')
GEdit probably adds that extra newline when saving. Add an extra \n
:
fw.write('\n</TEXT>\n</DOC>\n')
Upvotes: 2