Tiger1
Tiger1

Reputation: 1377

How to effectively slice an utf-8 encoded file

I'm having trouble slicing a utf-8 encoded file. After opening using codecs, slicing a portion becomes difficult due to byte order marks (BOM) characters at the beginning that cause a shift.

See details of my attempts below.

def readfiles(filepaf):
    with codecs.open(filepaf,'r', 'utf-8') as f:
        g=f.read()
        q=' '.join(g.split())
        return q

q=readfiles(c:xxx)

q=Katharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shutting of a door...

>>> q[0:100]
u'\ufeffKatharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin'


>>> q[0:100].encode('utf-8')
'\xef\xbb\xbfKatharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin'

The only accurate result comes by directly printing a sliced portion, but my program makes use of sliced portions rather than printing, and most often the sliced portions are inaccurate due to the shift at the beginning.

Ideal output

Katharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin

Any suggestions on how to slice without having BOM characters at the beginning?

Upvotes: 2

Views: 341

Answers (1)

Joni
Joni

Reputation: 111309

Discard bytes that start with bits 10 from the beginning of the slice until you find a byte that doesn't. That one will start a new character. You'll have to skip at most 3 bytes.

Alternatively you can slice the Unicode string, that will not give you broken characters.

Note that \ufeff is a valid character: it's the zero width non-breaking space, that some broken text editors insert into the beginning of UTF8 files to identify them. If you want to skip it use the utf-8-sig encoding.

Upvotes: 1

Related Questions