Python re.findall fails at UTF-8 while rest of script succeeds

Question

I have this script that reads a large ammount of text files written in Swedish (frequently with the åäö letters). It prints everything just fine from the dictionary if I loop over d and dictionary[]. However, the regular expression (from the raw input with u'.*' added) fails at returning utf-8 properly.

# -*- coding: utf8 -*-
from os import listdir 
import re
import codecs
import sys

print "Välkommen till SOU-sök!"
search_word = raw_input("Ange sökord: ")

dictionary = {}
for filename in listdir("20tal"):
    with open("20tal/" + filename) as currentfile:
        text = currentfile.read()
        dictionary[filename] = text

for d in dictionary:
    result = re.findall(search_word + u'.*', dictionary[d], re.UNICODE)
    if len(result) > 0:
        print "Filnament är:
 %s 
och sökresultatet är:
 %s" % (d, result)

Edit: The output is as follows:

If I input:

katt

I get the following output:

Filnament är: Betänkande och förslag angående vissa ekonomiska spörsmål   berörande enskilda järnvägar - SOU 1929:2.txt 

och sökresultatet är: 

['katter, r\xc3\xa4ntor m. m.', 'katter m- m., men exklusive r \xc3\xa4 nor m.', 'kattemedel subventionerar', av totalkostnaderna, ofta \xe2\x80\x94 med eller utan', 'kattas den nuvarande bilparkens kapitalv\xc3\xa4rde till 500 milj.

Here, the Filename d is printed correctly but not the result of the re.findall

l&#39;L&#39;l · Accepted Answer

In Python 2.x unicode list items normally output escaped unless you loop through each or join them; maybe try something such as this:

result = ', '.join(result)

if len(result) > 0:
    print ( u"Filnament är:
 %s 
och sökresultatet är:
 %s" % (d, result.decode('utf-8')))

Input:

katt

Result:

katter, räntor m. m. katter m- m., men exklusive r ä nor m. kattemedel subventionerar av totalkostnaderna, ofta — med eller utan kattas den nuvarande bilparkens kapitalvärde till 500 milj

Python re.findall fails at UTF-8 while rest of script succeeds

Answers (2)

Related Questions