Reputation: 147
I want to make a python script that uses a regular expression to filter the lines that have certain greek words out of a source text which I provided and then write those lines to 3 different files depending on the words encountered.
Here is my code so far:
import regex
source=open('source.txt', 'r')
oti=open('results_oti.txt', 'w')
tis=open('results_tis.txt', 'w')
ton=open('results_ton.txt', 'w')
regex_oti='^.*\b(ότι|ό,τι)\b.*$'
regex_tis='^.*\b(της|τις)\b.*$'
regex_ton='^.*\b(τον|των)\b.*$'
for line in source.readlines():
if regex.match(regex_oti, line):
oti.write(line)
if regex.match(regex_tis, line):
tis.write(line)
if regex.match(regex_ton, line):
ton.write(line)
source.close()
oti.close()
tis.close()
ton.close()
quit()
The words that I check for are ότι | ό,τι | της | τις | τον | των
.
The problem is that those 3 regular expressions (regex_oti
, regex_tis
, regex_ton
) do not match anything so the 3 text files I created do not contain anything.
Maybe its an encoding problem (Unicode)?
Upvotes: 3
Views: 1796
Reputation: 1124170
You are trying to match encoded values, as bytes, with a regular expression that most likely won't match unless your Python source encoding exactly matches that of the input files, and then only if you are not using a multi-byte encoding such as UTF-8.
You need to decode the input files to Unicode values, and use a Unicode regular expression. This means you need to know the codecs used for the input files. It's easiest to use io.open()
to handle decoding and encoding:
import io
import re
regex_oti = re.compile(ur'^.*\b(ότι|ό,τι)\b.*$')
regex_tis = re.compile(ur'^.*\b(της|τις)\b.*$')
regex_ton = re.compile(ur'^.*\b(τον|των)\b.*$')
with io.open('source.txt', 'r', encoding='utf8') as source, \
io.open('results_oti.txt', 'w', encoding='utf8') as oti, \
io.open('results_tis.txt', 'w', encoding='utf8') as tis, \
io.open('results_ton.txt', 'w', encoding='utf8') as ton:
for line in source:
if regex_oti.match(line):
oti.write(line)
if regex_tis.match(line):
tis.write(line)
if regex_ton.match(line):
ton.write(line)
Note the ur'...'
raw unicode strings to define the regular expression patterns; now these are Unicode patterns and match codepoints, not bytes.
The io.open()
call makes sure you read unicode
, and when you write unicode
values to the the output files the data is automatically encoded to UTF-8. I picked UTF-8 for the input file as well, but you need to check what the correct codec is for that file and stick to that.
I've used a with
statement here to have the files close automatically, used source
as an iterable (no need to read all lines into memory in one go), and pre-compiled the regular expressions.
Upvotes: 1