Thibaut Loiseleur
Thibaut Loiseleur

Reputation: 824

ValueError in rank method in pandas without more explanation

I have a pandas Dataframe like this :

     year   week           city  avg_rank
0    2016     52          Paris         1
1    2016     52 Gif-sur-Yvette         2
2    2016     52          Paris         1
3    2017      1          Paris         4
4    2016     52          Paris         3
5    2016     52          Paris         5
6    2016     52          Paris         2

But this code line :

df['real_index']=df.groupby(by=['year', 'week', 'city']).avg_rank.rank(method='first')

generates that stack trace :

/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in rank(self, axis, method, numeric_only, na_option, ascending, pct)

/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in wrapper(*args, **kwargs)
590                                                                 *args, **kwargs)
591                         except(AttributeError):
592                             raise ValueError
593
594             return wrapper

ValueError:

I have no NaN value in those columns of my DataFrame.

I am using python2.7 along with pandas 0.18.1 and numpy 1.11.0.

The shape of my DataFrame is consisting of about 9.000.000 rows and 15 columns.

What is more intriguing is that when I execute this code line in all subsets of my DataFrame (for each subset of 1.000.000 rows), I don't raise any ValueError.

Is that a known behavior that pandas does not manage well quite big DataFrame or did I miss something ?

Any help is welcome !

Upvotes: 3

Views: 2189

Answers (2)

Thibaut Loiseleur
Thibaut Loiseleur

Reputation: 824

Since my DataFrame came from several files, I noticed that some indexes were duplicated.

With

df.index = np.arange(df.shape[0])

just after loading the data, it now works.

Indeed, my hypothesis is that in some groups in the groupby there were sometimes rows with same indexing.

When I tried with subsets of my DataFrame, this case fortunately/unfortunately never happened.

However, the error message is not very exhaustive.

Upvotes: 9

sunwarr10r
sunwarr10r

Reputation: 4787

Probably the data is too large to load in memory, so breaking it up into multiple smaller files makes sense. Can you tell the size of your dataset? Where does the data come from, a csv file or a database? Maybe you should check out blaze: https://github.com/blaze/blaze

Upvotes: 0

Related Questions