Reputation: 641
I'm trying to read a CSV file into Pandas from am SFTP server using Paramiko:
with sftp.open(path + file.filename) as fp:
fp_aux = pd.read_csv(fp, separator = '|')
But when attempting it, it throws this error:
'utf-8' codec can't decode byte 0xa3 in position 73: invalid start byte
I've tried different encodings passing different parameters to the encoding
argument of pd.read_csv
function (unicode_escape, latin-1, latin1, latin, utf-8...). I have also tried with engine='python'
but no luck so far. Is there anything else I can try? If not, how can I ignore the error and continue to the next line or next df?
This is happening only if I try to read from the SFTP server, it works fine if I read it from my local disk.
Complete callstack of the error:
UnicodeDecodeError Traceback (most recent call last)
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._string_convert()
pandas\_libs\parsers.pyx in pandas._libs.parsers._string_box_utf8()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 83: invalid start byte
During handling of the above exception, another exception occurred:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-41-53537b824736> in <module>
1 with sftp.open(r'/Debtopdcarich/Mandatory File/MandatoryFile_190721.csv') as fp:
----> 2 fp_aux = (pd.read_csv(fp, encoding='iso-8859-1', sep='|'))
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
603 kwds.update(kwds_defaults)
604
--> 605 return _read(filepath_or_buffer, kwds)
606
607
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
461
462 with parser:
--> 463 return parser.read(nrows)
464
465
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
1050 def read(self, nrows=None):
1051 nrows = validate_integer("nrows", nrows)
-> 1052 index, columns, col_dict = self._engine.read(nrows)
1053
1054 if index is None:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py in read(self, nrows)
2054 def read(self, nrows=None):
2055 try:
-> 2056 data = self._reader.read(nrows)
2057 except StopIteration:
2058 if self._first_chunk:
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader.read()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()
pandas\_libs\parsers.pyx in pandas._libs.parsers.TextReader._string_convert()
pandas\_libs\parsers.pyx in pandas._libs.parsers._string_box_utf8()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 83: invalid start byte
Upvotes: 1
Views: 1989
Reputation: 202292
Pandas seems to be somehow confused by the Paramiko file-like object API. It does not use its encoding
argument, when presented with Paramiko file-like object.
Quick and dirty solution is to read the remote file to in-memory file-like object and present that to Pandas. Then the encoding
argument is used.
flo = BytesIO()
sftp.getfo(path + file.filename, flo)
flo.seek(0)
pd.read_csv(flo, separator = '|', encoding='iso-8859-1')
More efficient might be to build a wrapper class on top of Paramiko file-like object, with the API that Pandas can work with.
Upvotes: 3