Reputation: 527
I have this script that reads a large ammount of text files written in Swedish (frequently with the åäö letters). It prints everything just fine from the dictionary if I loop over d
and dictionary[]
. However, the regular expression (from the raw input with u'.*'
added) fails at returning utf-8 properly.
# -*- coding: utf8 -*-
from os import listdir
import re
import codecs
import sys
print "Välkommen till SOU-sök!"
search_word = raw_input("Ange sökord: ")
dictionary = {}
for filename in listdir("20tal"):
with open("20tal/" + filename) as currentfile:
text = currentfile.read()
dictionary[filename] = text
for d in dictionary:
result = re.findall(search_word + u'.*', dictionary[d], re.UNICODE)
if len(result) > 0:
print "Filnament är:\n %s \noch sökresultatet är:\n %s" % (d, result)
Edit: The output is as follows:
If I input:
katt
I get the following output:
Filnament är: Betänkande och förslag angående vissa ekonomiska spörsmål berörande enskilda järnvägar - SOU 1929:2.txt
och sökresultatet är:
['katter, r\xc3\xa4ntor m. m.', 'katter m- m., men exklusive r \xc3\xa4 nor m.', 'kattemedel subventionerar', av totalkostnaderna, ofta \xe2\x80\x94 med eller utan', 'kattas den nuvarande bilparkens kapitalv\xc3\xa4rde till 500 milj.
Here, the Filename d
is printed correctly but not the result of the re.findall
Upvotes: 2
Views: 1230
Reputation: 47169
In Python 2.x
unicode list items normally output escaped unless you loop through each or join them; maybe try something such as this:
result = ', '.join(result)
if len(result) > 0:
print ( u"Filnament är:\n %s \noch sökresultatet är:\n %s" % (d, result.decode('utf-8')))
Input:
katt
Result:
katter, räntor m. m. katter m- m., men exklusive r ä nor m. kattemedel subventionerar av totalkostnaderna, ofta — med eller utan kattas den nuvarande bilparkens kapitalvärde till 500 milj
Upvotes: 1
Reputation: 34288
The way file names are normalized is file system and OS dependent . Your particular regex may not match the normalization method correctly. Hence, consider this solution by remram:
import fnmatch
def myglob(pattern, directory=u'.'):
pattern = unicodedata.normalize('NFC', pattern)
results = []
enc = sys.getfilesystemencoding()
for name in os.listdir(directory):
if isinstance(name, bytes):
try:
name = name.decode(enc)
except UnicodeDecodeError:
# Filenames that are not proper unicode won't match any pattern
continue
if fnmatch.filter([unicodedata.normalize('NFC', name)], pattern):
results.append(name)
return results
I faced a similar problem here: Filesystem independent way of using glob.glob and regular expressions with unicode filenames in Python
Upvotes: 0