Python Split not working properly

Question

I have the following code to read the lines in a file and split them with a delimiter specified. After split I have to write some specific fields into another file.

Sample Data:

Week49_A_60002000;Mar;FY14;Actual;Working;E_1000;PC_000000;4287.63

Code:

import os
import codecs
sfilename = "WEEK_RPT_1108" + os.extsep + "dat"
sfilepath = "Club" + "/" + sfilename
sbackupname = "Club" + "/" + sfilename + os.extsep + "bak"
try:
    os.unlink(sbackupname)
except OSError:
    pass

os.rename(sfilepath, sbackupname)

try:
    inputfile = codecs.open(sbackupname, "r", "utf-16-le")
    outputfile = codecs.open(sfilepath, "w", "utf-16-le")
    sdelimdatfile = ";"
    for line in inputfile:
        record = line.split(';')
        outputfile.write(record[1])
except IOError, err:
    pass

I can see that the 0th array position contains the whole line instead of the first record:

record[0] = Week49_A_60002000;Mar;FY14;Actual;Working;E_1000;PC_000000;4287.63

while on printing record[1], it says array index out of range. Need help as new to python.

Thanks!

Serge Ballesta · Accepted Answer

After you comment saying that print line outputs u'\u6557\u6b65\u3934\u415f\u365f\u3030\u3230\u3030\u3b30\u614d\u3b72\u5946\u3431‌\u413b\u7463\u6175\u3b6c\u6f57\u6b72\u6e69\u3b67\u5f45\u3031\u3030\u503b\u5f43\u3‌030\u3030\u3030\u343b\u3832\u2e37\u3336', I can explain what happens and how to fix it.

What happens:

you have a normal 8bits characters file, and the line you show is even in plain ASCII, but you try to decode it as if it were in UTF-16 little endian. So you wrongly combine every two bytes in a single 16 bits unicode character! If your system had been able to correctly display them and if you had directly print line instead of repr(line), you would have got 敗步㤴䅟㙟〰㈰〰㬰慍㭲奆㐱䄻瑣慵㭬潗歲湩㭧彅〱〰倻彃〰〰〰㐻㠲⸷㌶. Of course, none of those unicode characters is the semicolon (; or \x3b of \u003b) so the line cannot be splitted on it.

But as you encode it back before writing record[0] you find the whole line in the new file, what let you believe erroneously that the problem is in the split function.

How to fix:

Just open the file normally, or use the correct encoding if it contains non ascii characters. But as you are using a version 2 of python, I would just do:

try:
    inputfile = open(sbackupname, "r")
    outputfile = open(sfilepath, "w")
    sdelimdatfile = ";"
    for line in inputfile:
        record = line.split(sdelimdatfile)
        outputfile.write(record[1])
except IOError, err:
    pass

If you really need to use the codecs module, for example if the file contains UTF8 or latin1 characters, you can replace the open part with:

encoding = "utf8"  # or "latin1" or whatever the actual encoding is...
inputfile = codecs.open(sbackupname, "r", encoding)
outputfile = codecs.open(sfilepath, "w", encoding)

Python Split not working properly

Answers (2)

What happens:

How to fix:

Related Questions