Reputation: 2425
I have a very large text file, where most of the lines are composed of ASCII characters, but a small fraction of lines have non-ASCII characters. What is the fastest way to create a new text file containing only the ASCII lines? Right now I am checking each character in each line to see if it's ASCII, and writing each line to the new file if all the characters are ASCII, but this method is rather slow. Also, I am using Python, but would be open to using other languages in the future.
Edit: updated with code
#!/usr/bin/python
import string
def isAscii(s):
for c in s:
if ord(c) > 127 or ord(c) < 0:
return False
return True
f = open('data.tsv')
g = open('data-ASCII-only.tsv', 'w')
linenumber = 1
for line in f:
if isAscii(line):
g.write(line)
linenumber += 1
f.close()
g.close()
Upvotes: 0
Views: 143
Reputation: 13576
The other approaches given will work if, and only if, the file is encoded in such a way that "non-ASCII" is equivalent to "high bit set", such as Latin-1 or UTF-8. Here's a program in Python 3 that will work with any encoding.
#!/usr/bin/env python3
import codecs
in_fname = "utf16file"
in_encoding = "utf-16"
out_fname = "ascii_lines"
out_encoding = "ascii"
def is_ascii(s):
try:
s.encode("ascii")
except UnicodeEncodeError:
return False
return True
f_in = codecs.open(in_fname, "r", in_encoding)
f_out = codecs.open(out_fname, "w", out_encoding)
for s in f_in:
if is_ascii(s):
f_out.write(s)
f_in.close()
f_out.close()
Upvotes: 0
Reputation: 1485
The following suggestion uses a command-line filter (ie, you would use it on the shell command line), this example works in a shell on linux or unix systems, maybe OSX too (I've heard OSX is BSDish):
$ cat big_file | tr -dc '\000-\177' > big_file_ascii_only
It uses the "tr" (translate) filter. In this case, we are telling tr to "delete" all characters which are outside the range octal-000 to octal-177. You may wish to tweak the charcter set - check the man page for tr to get some ideas on other ways to specify the characters you want to keep (or delete)
Upvotes: 0
Reputation: 1478
You can use grep: "-v" keeps the opposite, -P uses perl regex syntax, and [\x80-\xFF] is the character range for non-ascii.
grep -vP "[\x80-\xFF]" data.tsv > data-ASCII-only.tsv
See this question How do I grep for all non-ASCII characters in UNIX for more about search for ascii characters with grep.
Upvotes: 1