Reputation: 7879
I am processing a large text file (500k lines), formatted as below:
S1_A16
0.141,0.009340221649748676
0.141,4.192618196894668E-5
0.11,0.014122135626540204
S1_A17
0.188,2.3292323316081486E-6
0.469,0.007928706856794138
0.172,3.726771730573038E-5
I'm using the code below to return the correlation coefficients of each series, e.g. S!_A16:
import numpy as np
import pandas as pd
import csv
pd.options.display.max_rows = None
fileName = 'wordUnigramPauseTEST.data'
df = pd.read_csv(fileName, names=['pause', 'probability'])
mask = df['pause'].str.match('^S\d+_A\d+')
df['S/A'] = (df['pause']
.where(mask, np.nan)
.fillna(method='ffill'))
df = df.loc[~mask]
result = df.groupby(['S/A']).apply(lambda grp: grp['pause'].corr(grp['probability']))
print(result)
However, on some large files, this returns the error:
Traceback (most recent call last):
File "/Users/adamg/PycharmProjects/Subj_AnswerCorrCoef/GetCorrCoef.py", line 15, in <module>
print(result)
File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 35, in __str__
return self.__bytes__()
File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/base.py", line 47, in __bytes__
return self.__unicode__().encode(encoding, 'replace')
File "/Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/series.py", line 857, in __unicode__
result = self._tidy_repr(min(30, max_rows - 4))
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'
I understand that this is related to the print
statement, but how do I fix it?
EDIT: This is related to the maximum number of rows. Does anyone know how to accommodate a greater number of rows?
Upvotes: 2
Views: 943
Reputation: 879749
The error message:
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'
is saying None
minus an int
is a TypeError. If you look at the next-to-last line in the traceback you see that the only subtraction going on there is
max_rows - 4
So max_rows
must be None
. If you dive into /Users/adamg/anaconda/lib/python2.7/site-packages/pandas/core/series.py
, near line 857 and ask yourself how max_rows
could end up being equal to None
, you'll see that somehow
get_option("display.max_rows")
must be returning None
.
This part of the code is calling _tidy_repr
which is used to summarize the Series. None
is the correct value to set when you want pandas to display all lines of the Series
.
So this part of the code should not have been reached when max_rows
is None.
I've made a pull request to correct this.
Upvotes: 5