Reputation: 75
I can load a data set from scikit-learn
using
from sklearn import datasets
data = datasets.load_boston()
print(data)
What I'd like to do is write this data set to a flat file (.csv
)
Using the open()
function,
f = open('boston.txt', 'w')
f.write(str(data))
works, but includes the description of the data set.
I'm wondering if there is some way that I can generate a simple .csv
with headers from this Bunch object so I can move it around and use it elsewhere.
Upvotes: 2
Views: 9141
Reputation: 63
There are various toy datasets in scikit-learn such as Iris and Boston datasets. Let's load Boston dataset:
from sklearn import datasets
boston = datasets.load_boston()
What type of object is this? If we examine its type, we see that this is a scikit-learn Bunch object.
print(type(boston))
Output:
<class 'sklearn.utils.Bunch'>
A scikit-learn Bunch object is a kind of dictionary. So, we should treat it as such. We can use dictionary methods. Let's look at the keys:
print(boston.keys())
output:
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
Here we are interested in data, feature_names and target keys. We will import pandas module and use these keys to create a pandas DataFrame.
import pandas as pd
df = pd.DataFrame(data=boston['data'], columns=boston['feature_names'])
We should also add the target variable to the DataFrame. Target variable is what we try to predict. We should learn the target variable's name. It is written in the "DESCR". We can
print(boston["DESCR"])
and read the full description of the dataset.
In the description we see that the name of the target variable is MEDV. Now, we can add the target variable to the DataFrame:
df['MEDV'] = boston['target']
There is only one step left. We are exporting the DataFrame to a csv file without index numbers:
df.to_csv("scikit_learn_boston_dataset.csv", index=False)
BONUS: Iris dataset has additional parameters that we can utilize (look at here). Following code automatically creates the DataFrame with the target variable included:
iris = datasets.load_iris(as_frame=True)
df = iris["frame"]
Note: If we print(iris.keys())
, we can see the 'frame' key:
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
BONUS2: If we print(boston["filename"])
or print(iris["filename"])
, we can see the physical locations of the csv files of these datasets. For instance:
C:\Users\user\anaconda3\lib\site-packages\sklearn\datasets\data\boston_house_prices.csv
Upvotes: 2
Reputation: 1
Just wanted to modify the reply by adding that you should probably include the target variable--"MV"--as well. Added an additional line below:
from sklearn import datasets
import pandas as pd
data = datasets.load_boston()
print(data)
df = pd.DataFrame(data=data['data'], columns = data['feature_names'])
**df['MV'] = data['target']**
df.to_csv('boston.txt', sep = ',', index = False)
Upvotes: 0
Reputation: 39930
data = datasets.load_boston()
will generate a dictionary. In order to write the data to a .csv
file you need the actual data data['data']
and the columns data['feature_names']
. You can use these in order to generate a pandas dataframe and then use to_csv()
in order to write the data to a file:
from sklearn import datasets
import pandas as pd
data = datasets.load_boston()
print(data)
df = pd.DataFrame(data=data['data'], columns = data['feature_names'])
df.to_csv('boston.txt', sep = ',', index = False)
and the output boston.txt
should be:
CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
...
Upvotes: 8