user1894963
user1894963

Reputation: 665

Python read from file and remove non-ascii characters

I have the following program that reads a file word by word and writes the word again to another file but without the non-ascii characters from the first file.

import unicodedata
import codecs
infile = codecs.open('d.txt','r',encoding='utf-8',errors='ignore')
outfile = codecs.open('d_parsed.txt','w',encoding='utf-8',errors='ignore')


for line in infile.readlines():
    for word in line.split():
        outfile.write(word+" ")
    outfile.write("\n")

infile.close()
outfile.close()

The only problem that I am facing is that with this code it does not print a new line to the second file (d_parsed). Any clues??

Upvotes: 4

Views: 18667

Answers (3)

Hamza Tayyab
Hamza Tayyab

Reputation: 79

use codecs to open the csv file and then you can avoid the non-ascii characters

 import codecs   
reader = codecs.open("example.csv",'r', encoding='ascii', errors='ignore')
    for reading in reader:
        print (reader)

Upvotes: -1

jfs
jfs

Reputation: 414139

codecs.open() doesn't support universal newlines e.g., it doesn't translate \r\n to \n while reading on Windows.

Use io.open() instead:

#!/usr/bin/env python
from __future__ import print_function
import io

with io.open('d.txt','r',encoding='utf-8',errors='ignore') as infile, \
     io.open('d_parsed.txt','w',encoding='ascii',errors='ignore') as outfile:
    for line in infile:
        print(*line.split(), file=outfile)

btw, if you want to remove non-ascii characters, you should use ascii instead of utf-8.

If the input encoding is compatible with ascii (such as utf-8) then you could open the file in binary mode and use bytes.translate() to remove non-ascii characters:

#!/usr/bin/env python
nonascii = bytearray(range(0x80, 0x100))
with open('d.txt','rb') as infile, open('d_parsed.txt','wb') as outfile:
    for line in infile: # b'\n'-separated lines (Linux, OSX, Windows)
        outfile.write(line.translate(None, nonascii))

It doesn't normalize whitespace like the first code example.

Upvotes: 11

Mark Ransom
Mark Ransom

Reputation: 308111

From the docs for codecs.open:

Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.

I presume you're using Windows, where the newline sequence is actually '\r\n'. A file opened in text mode will do the conversion from \n to \r\n automatically, but that doesn't happen with codecs.open.

Simply write "\r\n" instead of "\n" and it should work fine, at least on Windows.

Upvotes: 2

Related Questions