cerebrou
cerebrou

Reputation: 5540

How to load sklearn datasets manually?

I would like to load a larger dataset from the sklearn datatsets (California housing prices). Using the default command does not work for me due to proxy issues (the dataset download corrupted).

d = datasets.fetch_california_housing()

After downloading it, I place it in the directory found by datasets.get_data_home() (/home/username/scikit_learn_data/) and place the files in that folder and also in the cal_housing folder in that directory and in the CaliforniaHousing folder to check all the options. I also tried specifying the custom location with data_home parameter.

d = datasets.fetch_california_housing(data_home='/home/username/scikit_learn_data/')

Nothing works.

How can I load the dataset manually?

Note: To test if the manual loading works, please set download_if_missing=False

Upvotes: 1

Views: 6504

Answers (4)

Jeremy Feng
Jeremy Feng

Reputation: 365

Oceansunfish's answer does work for me.

However, You may change this line

from sklearn.datasets.base import _pkl_filepath, get_data_home

to

from sklearn.datasets._base import _pkl_filepath, get_data_home

because the source code may have been changed.

I had better add a comment below the original answer but I don't have enough reputation to do so.

Upvotes: 0

Ahmad Javan
Ahmad Javan

Reputation: 41

from sklearn.datasets import california_housing
data = california_housing.fetch_california_housing()
calf_hous_df = pd.DataFrame(data= data.data, columns=data.feature_names)    
calf_hous_df.head(5)

Out[105]: 
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  

Upvotes: 0

Taurus
Taurus

Reputation: 1

Another option is:

  1. Download the cal_housing.tgz file manually as described above and copy it e.g. to C:\Temp.

  2. Open the file [YOUR_PYTHON_PATH]\Lib\site-packages\sklearn\datasets\base.py

  3. In the function _fetch_remote() comment out the line urlretrieve(remote.url, file_path). Then python don't try to download the file cal_housing.tgz again.

  4. Start d=datasets.fetch_california_housing(data_home='C://tmp//') and the file cal_housing_py3.pkz will be created.

  5. Then you should take back step 3.

I know this is a little bid ugly because you have to change an internal python package file. But it works.

Upvotes: 0

Oceansunfish
Oceansunfish

Reputation: 52

I had the same problem, since my developing environment cannot access the web, so that the default 'download_if_missing' of course could not work.

If you follow the URL, you will be able to save a cal_housing.tgz file from the web. But the fetch_california_housing method actually does some more conversion in the default download_if_missing fork and expects a pkz-file (as Vivek already pointed out above). I found a workaround to create that .pkz file here: (https://github.com/ageron/handson-ml/issues/221). I repost Auréliens answer (kudos!) here:

Quote: " I can give you a workaround:

Manually download the data using your web browser: https://ndownloader.figshare.com/files/5976036 Make sure the downloaded file is named cal_housing.tgz. Execute the following Python code:

import numpy as np
import os
import tarfile
from sklearn.externals import joblib
from sklearn.datasets.base import _pkl_filepath, get_data_home

archive_path = "cal_housing.tgz" # change the path if it's not in the current directory
data_home = get_data_home(data_home=None) # change data_home if you are not using ~/scikit_learn_data
if not os.path.exists(data_home):
    os.makedirs(data_home)
filepath = _pkl_filepath(data_home, 'cal_housing.pkz')

with tarfile.open(mode="r:gz", name=archive_path) as f:
    cal_housing = np.loadtxt(
        f.extractfile('CaliforniaHousing/cal_housing.data'),
        delimiter=',')
    # Columns are not in the same order compared to the previous
    # URL resource on lib.stat.cmu.edu
    columns_index = [8, 7, 2, 3, 4, 5, 6, 1, 0]
    cal_housing = cal_housing[:, columns_index]

    joblib.dump(cal_housing, filepath, compress=6)

After that you should be able to use fetch_california_housing() without any problem. This works because the code above is what fetch_california_housing() does the first time you call it, after it downloaded the data: it prepares the data and saves it in cache in Scikit-Learn's data directory (which is by default $HOME/scikit_learn_data). The next times you call fetch_california_housing(), it just loads the data from this directory, so it will not need to download it.

Hope this helps, Aurélien "

End of Auréliens Quote

Upvotes: 4

Related Questions