user2129623
user2129623

Reputation: 2257

ValueError: could not convert string to float: '1,141'

I am reading data from csv to perform feature elimination. Here is how data look like

shift_id    user_id status  organization_id location_id department_id   open_positions  city    zip role_id specialty_id    latitude    longitude   years_of_experience
0   2   9   S   1   1   19  1   brooklyn    48001.0 2.0 9.0 42.643  -82.583 NaN
1   6   60  S   12  19  20  1   test    68410.0 3.0 7.0 40.608  -95.856 NaN
2   9   61  S   12  19  20  1   new york    48001.0 1.0 7.0 42.643  -82.583 NaN
3   10  60  S   12  19  20  1   test    68410.0 3.0 7.0 40.608  -95.856 NaN
4   21  3   S   1   1   19  1   pune    48001.0 1.0 2.0 46.753  -89.584 0.0

Here is my code -

dataset = pd.read_csv("data.csv",header = 0)
data = pd.read_csv("data.csv",header = 1)
target = dataset.location_id
#dataset.head()
svm = LinearSVC()
rfe = RFE(svm, 3)
rfe = rfe.fit(data, target)
print(rfe.support_)
print(rfe.ranking_)

But I am getting this error

ValueError: could not convert string to float: '1,141'

There is not string like this in my database.

There are some empty cell. So I tried to use -

result.fillna(0, inplace=True)

Which gave this error

ValueError: Expected 2D array, got scalar array instead:
array=None.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Any suggestion how to preprocess this data correctly?

Here is link to sample data- https://gist.github.com/karimkhanp/6db4f9f9741a16e46fc294b8e2703dc7

Upvotes: 2

Views: 2988

Answers (3)

DirtyBit
DirtyBit

Reputation: 16772

1,141 is an invalid float.

To convert it to float, you should first convert it to a valid type, replacing , with . and then casting it to float would work.

bad_float = '1,141'

print(float(bad_float.replace(",",".")))

OUTPUT:

1.141

EDIT:

As noted by @ShadowRanger, Unless the comma is actually supposed to be a comma for separating digit groupings (to make it more human readable):

comm_sep = '1,141'

res = comm_sep.split(",")

print(float(res[0]), float(res[1]))

OUTPUT:

1.0 141.0

EDIT 2:

The issue was resolved by OP as he changed the column type to number explicitly from the csv file editor.

Upvotes: 0

Valdi_Bo
Valdi_Bo

Reputation: 30971

Your question contains result.fillna(0, inplace=True).

But since result appears nowhere before, it is not clear, what is its value (probably a scalar).

Another weird detail in your code. Look at:

dataset = pd.read_csv("prod_data_for_ML.csv",header = 0)
data = pd.read_csv("prod_data_for_ML.csv",header = 1)

Note that you read twice, from the same file, but:

  • the first time you read with header = 0, so, as the documentation states, column names are inferred from the first line,
  • the second time you read with header = 1.

Is this your intention? Or maybe in both calls header should be the same?

And one more remark: Reading 2 times from the same file is (in my opinion) unnecessary. Maybe your code should be like this:

data = pd.read_csv("prod_data_for_ML.csv",header = 0)
target = data.location_id

Edit

As I undestood from your comments, you want:

  • the first table - dataset - with the first column (shift_id),
  • the second table - data - without this column.

Then your code should contain:

dataset = pd.read_csv("data.csv",header = 0)  # Read the whole source file, reading column names from the starting row
data = dataset.drop(columns='shift_id')       # Copy dropping "shift_id" column
...

Note that header=1 does not "skip" any column, but states only from which source row read column names. In this case:

  • Row No 0 (the starting row, containing actual column names) is skipped.
  • Column names are read from the next row (due to header=1), containing actually the first row of data.
  • Only the remaining rows are read into rows of the target table.

If you want to "skip" some source columns, call read_csv with usecols parameter, but it specifies which columns to read (not to skip).

So, assuming that your source file has 14 columns (numbered from 0 to 13), and you want to omit only the first (number 0), you could write usecols=[*range(1, 14)] (note that the upper limit (14) is not included in the range).

And one more remark concerning your data sample: The first column is the index, without any name. shift_id is the next column, so to avoid confusion, you should put some indentation in the first row.

Note that City column is in your header at position 8, but in data rows (Brooklyn, test) at position 9. So the "title" row (column names) should be indented.

Edit 2

Look at your comment to the question, written 2019-02-14 12:40:19Z. It contains a row like this:

"1,141","1,139",A,14,24,77,1,OWINGS MILLS,"21117"

It shows that first 2 columns (shift_id and user_id) contain string representation of a float but with a comma instead of a dot.

You can cope with this problem using your own converter function, e.g.:

def cnvToFloat(x):
    return float(x.replace(',', '.'))

and call read_csv passing this function in convertes parameter, for such "required" (ill-formatted) columns e.g.:

dataset = pd.read_csv("data.csv", header = 0, 
    converters={'shift_id': cnvToFloat, 'user_id': cnvToFloat})

Upvotes: 1

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25189

The solution to your ValueError: could not convert string to float: '1,141' is using a thousands param in your pd.read_csv():

dataset = pd.read_csv("data.csv",header = 0, thousands= r",")
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 14 columns):
shift_id                3 non-null int64
user_id                 3 non-null int64
status                  3 non-null object
organization_id         3 non-null int64
location_id             3 non-null int64
department_id           3 non-null int64
open_positions          3 non-null int64
city                    3 non-null object
zip                     3 non-null int64
role_id                 3 non-null int64
specialty_id            2 non-null float64
latitude                3 non-null float64
longitude               3 non-null float64
years_of_experience     3 non-null object
dtypes: float64(3), int64(8), object(3)
memory usage: 416.0+ bytes

Upvotes: 3

Related Questions