xpt
xpt

Reputation: 22994

KeyError from pandas DataFrame groupby

This is a very strange error, I got KeyError when doing pandas DataFrame groupby for no obvious reason.

df = pd.read_csv('test.csv')
df.tail(5)

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Data columns (total 3 columns):
Id     165 non-null object
Time    165 non-null object
Val     165 non-null float64
dtypes: float64(1), object(2)
memory usage: 3.9+ KB

df.columns
Index([u'Id', u'Time', u'Val'], dtype='object')

df.groupby(['Id'])
KeyErrorTraceback (most recent call last)
<ipython-input-24-bba5c2dc5f75> in <module>()
----> 1 df.groupby(['Id'])

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, **kwargs)
   3776         return groupby(self, by=by, axis=axis, level=level, as_index=as_index,
   3777                        sort=sort, group_keys=group_keys, squeeze=squeeze,
-> 3778                        **kwargs)
...
/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   3288 
   3289             if not isnull(item):
-> 3290                 loc = self.items.get_loc(item)
   3291             else:
   3292                 indexer = np.arange(len(self.items))[isnull(self.items)]

/usr/local/lib/python2.7/dist-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
   1945                 return self._engine.get_loc(key)
   1946             except KeyError:
-> 1947                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1948 
   1949         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4154)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4018)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12368)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12322)()

KeyError: 'Id'

Note that, using df.columns = df.columns.map(str.strip) as suggested doesn't make any different -- I'm still getting the exact same output from df.columns and error as above:

df.columns = df.columns.map(str.strip)
df.columns
Out[38]:
Index([u'Id', u'Time', u'Val'], dtype='object')

If there is anywhere that I can post this "test.csv", I can do it, because I am almost certain that the problem is the format of the file -- the "test.csv" is Windows based, and is output from SQL Server SSMS. This is very important, as I opened, copied & saved the exactly content using Notepad++, and there won't be such problem with the newly saved file.

Using file test.csv under Linux shows:

test.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators

Here are the top several bytes from the file:

0000000 ef bb bf 49 64 2c 54 69 - 6d 65 2c 56 61 6c 0d 0a  Id,Time,Val..
0000020 54 35 31 31 35 2c 30 30 - 3a 30 30 3a 30 30 2c 32  T5115,00:00:00,2
0000040 30 2e 38 31 39 0d 0a 54 - 35 31 31 35 2c 30 30 3a  0.819..T5115,00:
0000060 30 30 3a 30 33 2c 31 36 - 2e 39 32 36 0d 0a 54 35  00:03,16.926..T5
0000100 31 31 35 2c 30 30 3a 30 - 30 3a 30 38 2c 31 31 2e  115,00:00:08,11.
0000120 33 34 33 0d 0a 54 35 31 - 31 35 2c 30 30 3a 30 30  343..T5115,00:00
0000140 3a 31 37 2c 36 2e 39 37 - 35 0d 0a 54 35 31 31 35  :17,6.975..T5115
0000160 2c 30 30 3a 30 30 3a 32 - 39 2c 31 33 2e 35 35 33  ,00:00:29,13.553
0000200 0d 0a 54 35 31 31 35 2c - 30 30 3a 30 30 3a 33 35  ..T5115,00:00:35

Any idea how to solve it? Thx.

Upvotes: 0

Views: 5292

Answers (2)

xpt
xpt

Reputation: 22994

Got to the bottom of it -- it is in fact that the Windows based csv files IS the root cause.

Proofs:

  1. I opened, copied & saved the exactly content using Notepad++, and there won't be such problem with the newly saved file.
  2. If I convert it using dos2unix under Linux, then try the above same code, it would work. The groupby will not threw exception any more.

Filed a bug at https://github.com/pandas-dev/pandas/issues/16690

And the solution to it, if the pandas is earlier than 0.19, use encoding='utf-8-sig' in pd.read_csv:

df = pd.read_csv('test.csv', encoding='utf-8-sig')

Upvotes: 1

Ben G.
Ben G.

Reputation: 306

Line terminators in Windows differ from other operating systems--in ASCII encoding, the newline character in Unix-like operating systems is LF, in Windows it's CRLF. In order to maintain compatibility between systems, Git allows the option to save files with the CRLF line ending in Windows and LF in other operating systems. This is causing your problem--when pandas.read_csv runs on your file saved with SQL Server, it has the CRLF line endings and pandas interprets it as having an extra CR character at the end of every line.

Luckily for you, the read_csv function includes a parameter lineterminator which you can set to "\r", the CR character, to have your lines read properly.

Upvotes: 1

Related Questions