Reputation: 612
I am using HIGGS dataset for my Data Mining project. While parsing the data in python I received the following error,
ValueError: invalid literal for float(): -8.854051232337951660e-
I am getting this error for many values of same kind. I am using Apache Spark for distributed environment.
This is my row in dataset.
1.000000000000000000e+00,8.004817962646484375e-01,-3.643184900283813477e-01,-4.785313606262207031e-01,2.399173498153686523e+00,**-8.854051232337951660e-01**,1.204909682273864746e+00,-8.518521487712860107e-02,1.364478588104248047e+00,0.000000000000000000e+00,4.605550169944763184e-01,1.564514338970184326e-01,1.068501710891723633e+00,0.000000000000000000e+00,1.793796300888061523e+00,1.236290574073791504e+00,5.773849487304687500e-01,2.548224449157714844e+00,1.083405137062072754e+00,1.178002059459686279e-01,-1.116195082664489746e+00,0.000000000000000000e+00,8.484367132186889648e-01,1.113812208175659180e+00,9.878969192504882812e-01,5.820630192756652832e-01,4.325648546218872070e-01,1.004681587219238281e+00,8.518054485321044922e-01
I have checked and there are no discrepancies in data.
Can someone help me with this error message?
Upvotes: 3
Views: 2009
Reputation: 91149
According to
ValueError: invalid literal for float(): -8.854051232337951660e-
the parser splits up that value too early.
Thus, you should have a look how the items look like when split up.
So try
for x in line.split(','):
print repr(x),
print repr(float(x))
and you'll see what happens for each item.
Personally, I have no idea why this might happen except for a corrupted data file which has a line breadk or comma where it shouldn't have.
Upvotes: 0
Reputation: 5381
As the exception suggests,
-8.854051232337951660e- is not a valid float in python
In particular, scientific notation is fine but it needs to have something after that e
- your data is malformed. The following would be acceptable;
Or from the docs if you prefer
Some examples of floating point literals:
3.14 10. .001 1e100 3.14e-10 0e0
The data without a trailing digit does not mean anything. Without the e, python can assume the literal terminated; with an additional digit(s), python can expand the scientific notation
If the data looks fine to you but python can't seem to figure out what's (supposed to be) going on, check for subtle mis-formatting like blank space in between the e
and the next digit
In response to edit
That last point is key. The data looks good to you but python complains; that's because how you're "parsing" in python doesn't align with how you're parsing with your eyes and brain. What are you using to parse the data? Do you split by comma? Do you split when digits start (that would cause problems). The exception is raised as described above; for you, the problem is tracking down why you are cropping out the last digit in your parse . (By the way, That sounds like a new question to me, not a continuation of this question).
For example, in your newly posted code, there looks like there is a newline starting after the "e-" and before the "01". If that's my browser, then... oh well. If not, then that is your problem
To skip the erroneous entries, you can do something like this (tl;dr try/except them, because it's better to ask forgiveness than permission)
Upvotes: 2