Reputation: 3173
i have a huge word document file that has more than 10,000 lines in it and it contains random empty lines and also weird characters, and i want to save it as a .txt or a .fasta file to read each line as string, and run through my program to pull out only the fasta headers and their sequences.
i have searched online and all of the posts about encoding issues just make it more confusing for me.
so far i have tried:
1) save the word document file as a .txt file with unicode(UTF-8) option. and ran my code below, about 1000 lines were outputted until it hit an error.
with open('TemplateMaster2.txt', encoding='utf-8') as fin, open('OnlyFastaseq.fasta', 'w') as fout:
for line in fin:
if line.startswith('>'):
fout.write(line)
fout.write(next(fin))
error message:
UnicodeEncodeError: 'charmap' codec can't encode chracter '\uf044' in position 11: character maps to <undefined>
2) save the word document file as a .txt file with unicode(UTF-8) option. about 1000 some lines were outputted until it hit a different error.
with open('TemplateMaster2.txt') as fin, open('OnlyFastaseq.fasta', 'w') as fout:
for line in fin:
if line.startswith('>'):
fout.write(line)
fout.write(next(fin))
error message:
unicodeDecodeError: 'charmap' code can't decode byte 0x81 in position 5664: character map to <undefined>
I can try different options for saving that word document as a .txt file but there are too many options and i am not sure what the problem really is. Should i save the word document as .txt with the option of 'unicode' or 'unicode(Big-Endian)', or 'unicode(UTF-7)', or 'Unicode(UTF-8)', or 'US-ASCII', etc.
Upvotes: 0
Views: 1229
Reputation: 31679
The only thing which seems to be missing is encoding='utf-8'
in your open statement for fout
.
with open('TemplateMaster2.txt', 'r', encoding='utf-8') as fin, open('OnlyFastaseq.fasta', 'w', encoding='utf-8') as fout:
for line in fin:
if line.startswith('>'):
fout.write(line)
seq = next(fin)
fout.write(seq)
Did you double check if your sequences are really every time only in one line?
Upvotes: 1