How to effectively slice an utf-8 encoded file

Question

I'm having trouble slicing a utf-8 encoded file. After opening using codecs, slicing a portion becomes difficult due to byte order marks (BOM) characters at the beginning that cause a shift.

See details of my attempts below.

def readfiles(filepaf):
    with codecs.open(filepaf,'r', 'utf-8') as f:
        g=f.read()
        q=' '.join(g.split())
        return q

q=readfiles(c:xxx)

q=Katharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shutting of a door...

>>> q[0:100]
u'\ufeffKatharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin'


>>> q[0:100].encode('utf-8')
'\xef\xbb\xbfKatharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin'

The only accurate result comes by directly printing a sliced portion, but my program makes use of sliced portions rather than printing, and most often the sliced portions are inaccurate due to the shift at the beginning.

Ideal output

Katharine opened her lips and drew in her breath, as if to reply with equal vigor, when the shuttin

Any suggestions on how to slice without having BOM characters at the beginning?

How to effectively slice an utf-8 encoded file

Answers (1)

Related Questions