Reputation: 2257
I am reading data from csv to perform feature elimination. Here is how data look like
shift_id user_id status organization_id location_id department_id open_positions city zip role_id specialty_id latitude longitude years_of_experience
0 2 9 S 1 1 19 1 brooklyn 48001.0 2.0 9.0 42.643 -82.583 NaN
1 6 60 S 12 19 20 1 test 68410.0 3.0 7.0 40.608 -95.856 NaN
2 9 61 S 12 19 20 1 new york 48001.0 1.0 7.0 42.643 -82.583 NaN
3 10 60 S 12 19 20 1 test 68410.0 3.0 7.0 40.608 -95.856 NaN
4 21 3 S 1 1 19 1 pune 48001.0 1.0 2.0 46.753 -89.584 0.0
Here is my code -
dataset = pd.read_csv("data.csv",header = 0)
data = pd.read_csv("data.csv",header = 1)
target = dataset.location_id
#dataset.head()
svm = LinearSVC()
rfe = RFE(svm, 3)
rfe = rfe.fit(data, target)
print(rfe.support_)
print(rfe.ranking_)
But I am getting this error
ValueError: could not convert string to float: '1,141'
There is not string like this in my database.
There are some empty cell. So I tried to use -
result.fillna(0, inplace=True)
Which gave this error
ValueError: Expected 2D array, got scalar array instead:
array=None.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Any suggestion how to preprocess this data correctly?
Here is link to sample data- https://gist.github.com/karimkhanp/6db4f9f9741a16e46fc294b8e2703dc7
Upvotes: 2
Views: 2988
Reputation: 16772
1,141
is an invalid float.
To convert it to float, you should first convert it to a valid type, replacing ,
with .
and then casting it to float
would work.
bad_float = '1,141'
print(float(bad_float.replace(",",".")))
OUTPUT:
1.141
EDIT:
As noted by @ShadowRanger, Unless the comma is actually supposed to be a comma for separating digit groupings (to make it more human readable):
comm_sep = '1,141'
res = comm_sep.split(",")
print(float(res[0]), float(res[1]))
OUTPUT:
1.0 141.0
EDIT 2:
The issue was resolved by OP as he changed the column type
to number
explicitly from the csv file editor.
Upvotes: 0
Reputation: 30971
Your question contains result.fillna(0, inplace=True)
.
But since result
appears nowhere before, it is not clear, what is its value (probably a scalar).
Another weird detail in your code. Look at:
dataset = pd.read_csv("prod_data_for_ML.csv",header = 0)
data = pd.read_csv("prod_data_for_ML.csv",header = 1)
Note that you read twice, from the same file, but:
header = 0
, so, as the documentation
states, column names are inferred from the first line,header = 1
.Is this your intention? Or maybe in both calls header
should be the same?
And one more remark: Reading 2 times from the same file is (in my opinion) unnecessary. Maybe your code should be like this:
data = pd.read_csv("prod_data_for_ML.csv",header = 0)
target = data.location_id
As I undestood from your comments, you want:
dataset
- with the first column (shift_id
),data
- without this column.Then your code should contain:
dataset = pd.read_csv("data.csv",header = 0) # Read the whole source file, reading column names from the starting row
data = dataset.drop(columns='shift_id') # Copy dropping "shift_id" column
...
Note that header=1
does not "skip" any column, but states only from which source row read column names.
In this case:
header=1
),
containing actually the first row of data.If you want to "skip" some source columns, call read_csv
with usecols
parameter, but it specifies which columns to read (not to skip).
So, assuming that your source file has 14 columns (numbered from 0 to 13),
and you want to omit only the first (number 0), you could write
usecols=[*range(1, 14)]
(note that the upper limit (14) is not
included in the range).
And one more remark concerning your data sample:
The first column is the index, without any name.
shift_id
is the next column, so to avoid confusion, you should
put some indentation in the first row.
Note that City
column is in your header at position 8, but in data rows
(Brooklyn, test) at position 9.
So the "title" row (column names) should be indented.
Look at your comment to the question, written 2019-02-14 12:40:19Z. It contains a row like this:
"1,141","1,139",A,14,24,77,1,OWINGS MILLS,"21117"
It shows that first 2 columns (shift_id
and user_id
) contain
string representation of a float but with a comma instead of a dot.
You can cope with this problem using your own converter function, e.g.:
def cnvToFloat(x):
return float(x.replace(',', '.'))
and call read_csv
passing this function in convertes
parameter, for
such "required" (ill-formatted) columns e.g.:
dataset = pd.read_csv("data.csv", header = 0,
converters={'shift_id': cnvToFloat, 'user_id': cnvToFloat})
Upvotes: 1
Reputation: 25189
The solution to your ValueError: could not convert string to float: '1,141'
is using a thousands
param in your pd.read_csv()
:
dataset = pd.read_csv("data.csv",header = 0, thousands= r",")
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 14 columns):
shift_id 3 non-null int64
user_id 3 non-null int64
status 3 non-null object
organization_id 3 non-null int64
location_id 3 non-null int64
department_id 3 non-null int64
open_positions 3 non-null int64
city 3 non-null object
zip 3 non-null int64
role_id 3 non-null int64
specialty_id 2 non-null float64
latitude 3 non-null float64
longitude 3 non-null float64
years_of_experience 3 non-null object
dtypes: float64(3), int64(8), object(3)
memory usage: 416.0+ bytes
Upvotes: 3