Reputation: 2179
I trying to shuffle data with the following code.
import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
data = pd.read_csv('dataset.txt')
np.random.shuffle(data)
Running this however gives me the following error. I dont understand where this error is coming from.
Traceback (most recent call last):
File "sample2.py", line 12, in <module>
np.random.shuffle(data)
File "mtrand.pyx", line 4668, in mtrand.RandomState.shuffle (numpy/random /mtrand/mtrand.c:30498)
File "mtrand.pyx", line 4671, in mtrand.RandomState.shuffle (numpy/random/mtrand/mtrand.c:30438)
File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.py", line 1992, in __getitem__
return self._getitem_column(key)
File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.py", line 2004, in _getitem_column
result = result[key]
File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.py", line 1992, in __getitem__
return self._getitem_column(key)
File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.py", line 1999, in _getitem_column
return self._get_item_cache(key)
File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/generic.py", line 1345, in _get_item_cache
values = self._data.get(item)
File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/internals.py", line 3225, in get
loc = self.items.get_loc(item)
File "/Users/marcvanderpeet/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/indexes/base.py", line 1878, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 137, in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)
File "pandas/index.pyx", line 157, in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)
File "pandas/hashtable.pyx", line 675, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)
File "pandas/hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)
Any thoughts on what goes wrong here?
Upvotes: 1
Views: 1624
Reputation: 2821
You are applying a numpy function to a pandas dataframe.
You can convert the dataframe to a numpy array and shuffle that:
np.random.shuffle(data.values)
Or you can use a pandas function:
data = data.sample(len(data))
Upvotes: 6
Reputation: 3382
I don't really understand the whole traceback but to me the error simply comes from the fact that a dataframe is not a numpy array. To fix it just use the actual underlying array of the dataframe using data.values
.
My guess for what happends in the traceback is that np.random.shuffle
does not check if the input is a valid array and tries to operate and grab data from the dataframe the same way it would for a regular array hence all the error regarding getitem
and so on.
Upvotes: 0