Frits Verstraten
Frits Verstraten

Reputation: 2179

Strange error when trying to randomize a dataset

I trying to shuffle data with the following code.

import pandas as pd
import numpy as np

from sklearn.naive_bayes import MultinomialNB
 data = pd.read_csv('dataset.txt')
 np.random.shuffle(data)

Running this however gives me the following error. I dont understand where this error is coming from.

Traceback (most recent call last):
File "sample2.py", line 12, in <module>
 np.random.shuffle(data)
File "mtrand.pyx", line 4668, in mtrand.RandomState.shuffle (numpy/random /mtrand/mtrand.c:30498)
 File "mtrand.pyx", line 4671, in mtrand.RandomState.shuffle (numpy/random/mtrand/mtrand.c:30438)
 File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.py", line 1992, in __getitem__
 return self._getitem_column(key)
 File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.py", line 2004, in _getitem_column
 result = result[key]
 File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.py", line 1992, in __getitem__
 return self._getitem_column(key)
 File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.py", line 1999, in _getitem_column
 return self._get_item_cache(key)
 File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/generic.py", line 1345, in _get_item_cache
 values = self._data.get(item)
 File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/internals.py", line 3225, in get
 loc = self.items.get_loc(item)
 File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/indexes/base.py", line 1878, in get_loc
 return self._engine.get_loc(self._maybe_cast_indexer(key))
 File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc  (pandas/index.c:4027)
  File "pandas/index.pyx", line 157, in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)
  File "pandas/hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)
  File "pandas/hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)

Any thoughts on what goes wrong here?

Upvotes: 1

Views: 1624

Answers (2)

simon
simon

Reputation: 2821

You are applying a numpy function to a pandas dataframe.

You can convert the dataframe to a numpy array and shuffle that:

 np.random.shuffle(data.values)

Or you can use a pandas function:

data = data.sample(len(data))

Upvotes: 6

jadsq
jadsq

Reputation: 3382

I don't really understand the whole traceback but to me the error simply comes from the fact that a dataframe is not a numpy array. To fix it just use the actual underlying array of the dataframe using data.values.

My guess for what happends in the traceback is that np.random.shuffle does not check if the input is a valid array and tries to operate and grab data from the dataframe the same way it would for a regular array hence all the error regarding getitem and so on.

Upvotes: 0

Related Questions