Reputation: 824
I have a pandas Dataframe like this :
year week city avg_rank
0 2016 52 Paris 1
1 2016 52 Gif-sur-Yvette 2
2 2016 52 Paris 1
3 2017 1 Paris 4
4 2016 52 Paris 3
5 2016 52 Paris 5
6 2016 52 Paris 2
But this code line :
df['real_index']=df.groupby(by=['year', 'week', 'city']).avg_rank.rank(method='first')
generates that stack trace :
/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in rank(self, axis, method, numeric_only, na_option, ascending, pct)
/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in wrapper(*args, **kwargs)
590 *args, **kwargs)
591 except(AttributeError):
592 raise ValueError
593
594 return wrapper
ValueError:
I have no NaN
value in those columns of my DataFrame.
I am using python2.7
along with pandas 0.18.1
and numpy 1.11.0
.
The shape of my DataFrame is consisting of about 9.000.000 rows and 15 columns.
What is more intriguing is that when I execute this code line in all subsets of my DataFrame (for each subset of 1.000.000 rows), I don't raise any ValueError
.
Is that a known behavior that pandas
does not manage well quite big DataFrame or did I miss something ?
Any help is welcome !
Upvotes: 3
Views: 2189
Reputation: 824
Since my DataFrame came from several files, I noticed that some indexes were duplicated.
With
df.index = np.arange(df.shape[0])
just after loading the data, it now works.
Indeed, my hypothesis is that in some groups in the groupby there were sometimes rows with same indexing.
When I tried with subsets of my DataFrame, this case fortunately/unfortunately never happened.
However, the error message is not very exhaustive.
Upvotes: 9
Reputation: 4787
Probably the data is too large to load in memory, so breaking it up into multiple smaller files makes sense. Can you tell the size of your dataset? Where does the data come from, a csv file or a database? Maybe you should check out blaze: https://github.com/blaze/blaze
Upvotes: 0