Reputation: 2257
I am trying to learn numerical analysis. I am following this articles - http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html
My data looks like this :
date hr_of_day vals
2014-05-01 0 72
2014-05-01 1 127
2014-05-01 2 277
2014-05-01 3 411
2014-05-01 4 666
2014-05-01 5 912
2014-05-01 6 1164
2014-05-01 7 1119
2014-05-01 8 951
2014-05-01 9 929
2014-05-01 10 942
2014-05-01 11 968
2014-05-01 12 856
2014-05-01 13 835
2014-05-01 14 885
2014-05-01 15 945
2014-05-01 16 924
2014-05-01 17 914
2014-05-01 18 744
2014-05-01 19 377
2014-05-01 20 219
2014-05-01 21 106
2014-05-01 22 56
2014-05-01 23 43
2014-05-02 0 61
For given date and and hr, I want to predict the vals
and identify pattern.
I have written this code :
import pandas as pd
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
# read the data in
Train = pd.read_csv("data_scientist_assignment.tsv")
#print df.head()
x1=["date", "hr_of_day", "vals"]
#print x1
#print df[x1]
test=pd.read_csv("test.tsv")
model = LogisticRegression()
model.fit(Train[x1], Train["vals"])
print(model)
print model.score(Train[x1], Train["vals"])
print model.predict_proba(test[x1])
I am getting thsi error:
KeyError: "['date' 'hr_of_day' 'vals'] not in index"
What is the issue. Is there any better way to do this?
test file format:
date hr_of_day
2014-05-01 0
2014-05-01 1
2014-05-01 2
2014-05-01 3
2014-05-01 4
2014-05-01 5
2014-05-01 6
2014-05-01 7
Full error stake:
Traceback (most recent call last):
File "socratis.py", line 16, in <module>
model.fit(Train[x1], Train["vals"])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 1986, in __getitem__
return self._getitem_array(key)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 2030, in _getitem_array
indexer = self.ix._convert_to_indexer(key, axis=1)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.py", line 1210, in _convert_to_indexer
raise KeyError('%s not in index' % objarr[mask])
KeyError: "['date' 'hr_of_day' 'vals'] not in index"
Upvotes: 1
Views: 154
Reputation: 740
I suggest providing sep='\t'
parameter when reading TSV:
Train = pd.read_csv("data_scientist_assignment.tsv", sep='\t') # use TAB as column separator
When you fix this, there is another problem in the queue: ValueError: could not convert string to float: '2014-09-13'
This is because linear regression wants numeric features and column date
is a string type.
You can introduce new column timestamp
by converting the date to timestamp (seconds since epoch) and use it as a feature:
Train['timestamp'] = pd.to_datetime(Train['date']).apply(lambda a: a.timestamp())
x1=["timestamp", "hr_of_day", "vals"]
From a ML perspective, you shouldn't use your target value vals
as an input feature. You should also consider representing the date as individual features: day, mont, year; or day-of-week, it depends on what you want to model.
Upvotes: 1