Reputation: 21947
Does anyone have a Python API to get various ML datasets, along the lines
X, Y, info = mldata.load( name, db=, verbose= )
X: N x dim data, a NumPy array
Y: N, ints for class numbers or None
info: a dict with ...
I'd prefer straight python with NumPy, but if an Rpy function could just get data, that might be ok (sorry, don't speak much R).
For a "db", a flat file would be fine, like
#! http://archive.ics.uci.edu/ml/machine-learning-databases
# ncol nrow nclass year name etc.
3 2858 2 2008 "Character+Trajectories" Time-Series Classification, Clus
4 150 2 1988 "Iris" Multivariate Classification Real
8 768 2 1990 "Pima+Indians+Diabetes" Multivariate Classification Inte
...
Why just flat files instead of "real" dbs ? Because I can download them once, then browse, sort, awk them with near-0 effort; others may prefer a fancy search engine.
Whether data is stored locally or loaded over the web is for me a dont-care. (Do both, env MLDATAPATH = ( local dir ... url ... ) )?
(A basic API oughta be trivial for sites with uniform names and uniform data, but uniformizing e.g. uci/ml looks like quite a lot of dull work.)
Upvotes: 2
Views: 2058
Reputation: 738
You can check this package/code base for searching and importing any UCI ML repo data set. It will not load the data set in a Python object but just automatically search and download your choice of dataset from the portal. You can even choose all datasets of certain size and ML task category.
https://github.com/tirthajyoti/UCI-ML-API
Upvotes: 0
Reputation: 780
The folks from Scikits.learn solved that problem in the Scikits.learn examples
Datasets come in all shapes and sizes, though, so they do have custom code for dealing with each dataset. (It would be different if you only had, say, CSV or ARFF format datasets and not also grayscale images and whatnot).
Upvotes: 1