hyogapag
hyogapag

Reputation: 125

Convert a file of messed-up encoding type to something usable

I'm trying to clean up the contents of the page for the following link, obtained by a SPARQL query :

http://www.rechercheisidore.fr/sparql/query?query=PREFIX+dcterms%3A+%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E+PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E+SELECT+%3Furicollection+%3Ftitrecollection+%3Fdescription+%3Fadresseweb+WHERE+{+%3Furicollection+%3Fpredicat+%3Chttp%3A%2F%2Fwww.rechercheisidore.fr%2Fclass%2FCollection%3E.+%3Furicollection+dcterms%3Atitle+%3Ftitrecollection.+%3Furicollection+dcterms%3Adescription+%3Fdescription.+%3Furicollection+foaf%3Ahomepage+%3Fadresseweb.+}+ORDER+BY+ASC%28%3Ftitrecollection%29+LIMIT+300&format=application%2Frdf%2Bxml

The page is in French. Every letter with an accent is not shown correctly, and when trying to replace the characters with the good ones in Python, it returns me errors. I tried to convert the file to UTF-8 but that didn't solve anything (actually it's already in utf-8) hence the idea of messed-up enconding (an engineer from the website confirmed it was a bug in their triple-store). An example : instead of é you should see é.

I would like to have a file upon with I could at least use the python 2.7 str.replace() function to get back the correct characters -- or is there a better way to achieve this?

Sample from the RDF XML file demonstrating the problem:

<rdf:RDF xmlns:res="http://www.w3.org/2005/sparql-results#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:nodeID="rset">
<rdf:type rdf:resource="http://www.w3.org/2005/sparql-results#ResultSet" />
    <res:resultVariable>uricollection</res:resultVariable>
    <res:resultVariable>titrecollection</res:resultVariable>
    <res:resultVariable>description</res:resultVariable>
    <res:resultVariable>adresseweb</res:resultVariable>
    <res:solution rdf:nodeID="r0">
      <res:binding rdf:nodeID="r0c0"><res:variable>uricollection</res:variable><res:value rdf:resource="http://www.rechercheisidore.fr/resource/10670/3.ewe76u"/></res:binding>
      <res:binding rdf:nodeID="r0c1"><res:variable>titrecollection</res:variable><res:value>Actualités de l&#39;Ecole des Hautes Etudes en Sciences Sociales</res:value></res:binding>
      <res:binding rdf:nodeID="r0c2"><res:variable>description</res:variable><res:value>L&#39;Ãcole des hautes études en sciences sociales (EHESS), est issue de la transformation, en 1975, de la sixième section de l&#39;Ãcole pratique des hautes études, section de sciences économiques et sociales, fondée en 1947 par Lucien Febvre, Charles Morazé et Fernand Braudel. L&#39;EHESS occupe une place singulière dans le paysage français de la recherche. Elle forme des docteurs dans toutes les disciplines des sciences humaines et sociales, mais elle n&#39;est pas une université.</res:value></res:binding>
      <res:binding rdf:nodeID="r0c3"><res:variable>adresseweb</res:variable><res:value rdf:resource="http://www.ehess.fr"/></res:binding>
    </res:solution>

Upvotes: 1

Views: 239

Answers (2)

unutbu
unutbu

Reputation: 879271

Corroboration of jwodder's solution:

import lxml.etree as ET
import urllib2

url = "http://www.rechercheisidore.fr/sparql/query?query=PREFIX+dcterms:+<http://purl.org/dc/terms/>+PREFIX+foaf:+<http://xmlns.com/foaf/0.1/>+SELECT+?uricollection+?titrecollection+?description+?adresseweb+WHERE+{+?uricollection+?predicat+<http://www.rechercheisidore.fr/class/Collection>.+?uricollection+dcterms:title+?titrecollection.+?uricollection+dcterms:description+?description.+?uricollection+foaf:homepage+?adresseweb.+}+ORDER+BY+ASC(?titrecollection)+LIMIT+300&format=application/rdf+xml"
doc = ET.parse(urllib2.urlopen(url))

namespaces = { 'ns':'http://www.w3.org/2005/sparql-results#', }

for elt in doc.xpath('//ns:binding[@name="description"]/ns:literal',
                     namespaces=namespaces):
    text = elt.text
    if text is not None:
        text = text.encode('latin-1').decode('utf_8')
        print(text)
    break

yields

L'École des hautes études en sciences sociales (EHESS), est issue de la transformation, en 1975, de la sixième section de l'École pratique des hautes études, section de sciences économiques et sociales, fondée en 1947 par Lucien Febvre, Charles Morazé et Fernand Braudel. L'EHESS occupe une place singulière dans le paysage français de la recherche. Elle forme des docteurs dans toutes les disciplines des sciences humaines et sociales, mais elle n'est pas une université.

Upvotes: 3

jwodder
jwodder

Reputation: 57470

The problem with the page appears to be that the server encoded the text as UTF-8 and then treated the UTF-8 as Latin-1 and encoded it in UTF-8 again. To reverse this, read the file in as UTF-8, encode it as a Latin-1 string of bytes, and then decode the bytes as UTF-8.

Upvotes: 4

Related Questions