user3046211
user3046211

Reputation: 696

Error in importing text file from a URL to a pandas dataframe

I'm trying to import the contents of a text file into a pandas dataframe from a web page.

However, when I try to import using below code and try to print the column names, I get the below error.

import pandas as pd

df = pd.read_csv(
    "http://cs.joensuu.fi/sipu/datasets/s1.txt",
    index_col=None,
    sep=" "
)

Which results in the error below:

File "/Users/user/Desktop/Folder/Src/spiral.py", line 8, in <module>
    df = pd.read_csv('http://cs.joensuu.fi/sipu/datasets/s1.txt', index_col=None, sep=" ")
  File "/Users/user/miniforge3/envs/test_venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/Users/user/miniforge3/envs/test_venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 468, in _read
    return parser.read(nrows)
  File "/Users/user/miniforge3/envs/test_venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1057, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/Users/user/miniforge3/envs/test_venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 2061, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1951, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 9 fields in line 1334, saw 10

How can I import the text file from the above given URL into a pandas dataframe as separate columns ?

Upvotes: 1

Views: 397

Answers (1)

Laurent
Laurent

Reputation: 13478

This error message means that one line has not the expected number of columns (10 instead of 9).

As stated in Pandas documentation for read.csv() method, you could choose to skip the faulty row by setting on_bad_lines to "skip", like this:

import pandas as pd

df = pd.read_csv(
    "http://cs.joensuu.fi/sipu/datasets/s1.txt",
    index_col=None,
    sep=" ",
    on_bad_lines="skip",
)

print(df)
# Outputs
      Unnamed: 0  Unnamed: 1  Unnamed: 2  ...  Unnamed: 6  Unnamed: 7  550946
0            NaN         NaN         NaN  ...         NaN         NaN  557965
1            NaN         NaN         NaN  ...         NaN         NaN  575538
2            NaN         NaN         NaN  ...         NaN         NaN  551446
3            NaN         NaN         NaN  ...         NaN         NaN  608046
4            NaN         NaN         NaN  ...         NaN         NaN  557588
...          ...         ...         ...  ...         ...         ...     ...
4921         NaN         NaN         NaN  ...         NaN         NaN  853940
4922         NaN         NaN         NaN  ...         NaN         NaN  863963
4923         NaN         NaN         NaN  ...         NaN         NaN  861267
4924         NaN         NaN         NaN  ...         NaN         NaN  858702
4925         NaN         NaN         NaN  ...         NaN         NaN  842566

Upvotes: 1

Related Questions