LotsOfQuestions94
LotsOfQuestions94

Reputation: 11

Pandas read_csv with incorrect header delimiter

I have a csv file, say test.csv and looks like this

Data 1 | Data 2 | Data 3 |
1      | 2      | 3      |      |      |    6

If I open up the csv in notepad++ or text, it shows up like this

Data 1, Data 2, Data 3
1,2,3,,,6

instead of

Data 1, Data 2, Data 3, ,,
1,2,3,,,6

I get an error using pandas.read_csv because my rows have more columns than declared in my header row. Are there ways to find the maximum number of columns of the entire CSV?

Upvotes: 0

Views: 2122

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 30981

My proposition is to:

  • skip the title row,
  • specify your own list of column names.

So the code should be:

df = pd.read_csv(io.StringIO(txt), skiprows=1,
    names=['Data 1', 'Data 2', 'Data 3', 'Data 4', 'Data 5', 'Data 6'])

A detail to consider is how should be represented empty cells.

The default behaviour of read_csv is to treat them as NaN, but the downside is then that such columns are coerced to float.

Another possibility is to pass na_filter=False parameter and then:

  • such empty strings are represented "as is" (as an empty string),
  • but columns with these values are this time coerced to object, because string cells are intermixed with int.

Note also that the format of your title row is corrupted in one more detail. The source of problem are spaces after commas: They become the initial part of the next column name.

So even if you had just 3 columns of data, the column names would be:

['Data 1', ' Data 2', ' Data 3']

(note the leading spaces). So this is yet another reason to skip the title row.

Upvotes: 1

Related Questions