frazman
frazman

Reputation: 33273

python : ValueError: invalid literal for int() with base 10: ' '

I have a text file which contains entry like

70154::308933::3
UserId::ProductId::Score

I wrote this program to read: (Sorry the indendetion is bit messed up here)

def generateSyntheticData(fileName):
 dataDict = {}
 # rowDict = []
 innerDict = {}


 try:
    # for key in range(5):
    # count = 0
    myFile = open(fileName)
    c = 0
        #del innerDict[0:len(innerDict)]

    for line in myFile:
        c += 1
        #line = str(line)
        n = len(line)
        #print 'n: ',n
        if n is not 1:
       # if c%100 ==0: print "%d: "%c, " entries read so far"
       # words = line.replace(' ','_')
            words = line.replace('::',' ')

            words = words.strip().split()


            #print 'userid: ', words[0]
            userId = int( words[0]) # i get error here
            movieId = int (words[1])
            rating =float( words[2])
            print "userId: ", userId, " productId: ", movieId," :rating: ", rating
            #print words
            #words = words.replace('_', ' ')
            innerDict = dataDict.setdefault(userId,{})
            innerDict[movieId] = rating
            dataDict[userId] = (innerDict)
            innerDict = {}
except IOError as (errno,strerror):
    print "I/O error({0}) :{1} ".format(errno,strerror)

finally:
    myFile.close() 
print "total ratings read from file",fileName," :%d " %c
return dataDict

But i get the error:

ValueError: invalid literal for int() with base 10: ''

Funny thing is, it is working just fine reading the same format data from other file.. Actually while posting this question, I noticed something weird.. The entry 70154::308933::3 each number has a space.in between like 7 space 0 space 1 space 5 space 4 space :: space 3... BUt the text file looks fine..:( on copy pasting only it shows this nature.. Anyways.. but any clue whats going on. Thanks

Upvotes: 1

Views: 10608

Answers (2)

John Machin
John Machin

Reputation: 82992

The "spaces" thay you are seeing appear to be NULs ("\x00"). There is a 99.9% chance that your file is encoded in UTF-16, UTF-16LE, or UTF-16BE. If this is a one-off file, just open it with Notepad and save as "ANSI", not "Unicode" and not "Unicode bigendian". If however you need to process it as is, you'll need to know/detect what the encoding is. To find out which, do this:

print repr(open("yourfile.txt", "rb").read(20))

and compare the srtart of the output with the following:

>>> ucode = u"70154:"
>>> for sfx in ["", "LE", "BE"]:
...     enc = "UTF-16" + sfx
...     print enc, repr(ucode.encode(enc))
...
UTF-16 '\xff\xfe7\x000\x001\x005\x004\x00:\x00'
UTF-16LE '7\x000\x001\x005\x004\x00:\x00'
UTF-16BE '\x007\x000\x001\x005\x004\x00:'
>>>

You can make a detector that's good enough for your purposes by inspecting the first 2 bytes:

[pseudocode]
if f2b in `"\xff\xfe\xff"`: UTF-16
elif f2b[1] == `"\x00"`: UTF-16LE
elif f2b[0] == `"\x00"`: UTF-16BE
else: cp1252 or UTF-8 or whatever else is prevalent in your neck of the woods.

You could avoid hard-coding the fallback encoding:

>>> import locale
>>> locale.getpreferredencoding()
'cp1252'

Your line-reading code will look like this:

rawbytes = open(myFile, "rb").read()
enc = detect_encoding(rawbytes[:2])
for line in rawbytes.decode(enc).splitlines():
    # whatever

Oh, and the lines will be unicode objects ... if that gives you a problem, ask another question.

Upvotes: 3

paxdiablo
paxdiablo

Reputation: 882028

Debugging 101: simply change the line:

words = words.strip().split()

to:

words = words.strip().split()
print words

and see what comes out.

I will mention a couple of things. If you have the literal UserId::... in the file and you try to process it, it won't take kindly to trying to convert that to an integer.

And the ... unusual line:

if n is not 1:

I would probably write as:

if n != 1:

If, as you indicate in your comment, you end up seeing:

['\x007\x000\x001\x005\x004\x00', '\x003\x000\x008\x009\x003\x003\x00', '3']

then I'd be checking your input file for binary (non-textual) data. You should never end up with that binary information if you're just reading text and trimming/splitting.

And because you state that the digits seem to have spaces between them, you should do a hex dump of the file to find out what's really in there. It may be a UTF-16 Unicode string, for example.

Upvotes: 2

Related Questions