toom
toom

Reputation: 13327

read_csv appends a .1 to column name

I just noticed a strange problem when reading a csv file with pandas read_csv.

When I open my file in an editor the header looks like this (there are a lot of columns so I skip most of them by adding ... here):

tag_identifier,a,article,aside,b,body,br,button,circle,clippath,dd,defs,desc,div,dl,dt,em,fecolormatrix,...,uses_most_common_font_family,uses_most_common_font_size,uses_most_common_font_family_and_size,is_article

When I now do a df = pd.read_csv("/path/to/csv-file.csv")

and the I check the columns like this: print(df.columns.tolist())

I suddently get this output:

['tag_identifier', 'a', 'article', 'aside', 'b', 'body', 'br', 'button' ..., 'title', 'tspan', 'ul', 'use', 'a.1', 'article.1', 'aside.1', 'b.1', 'body.1', 'br.1', 'button.1', ... ]

As you can see the column names that correspond to a html tag are copied and a .1 is appended.

For example the body tag is copied and set to body.1.

So eventually I have now two columns: body and body.1 which I can both access via df["body"] and df["body.1"].

Even stranger, this only happens to the html-tag column names. All other column names are unaffected.

Has anybody an idea what could cause this issue?

Upvotes: 2

Views: 1426

Answers (1)

Robert Navado
Robert Navado

Reputation: 1329

This means you have duplicate column names. Rename them or if they're really duplicate get rid of them in the data. Anyway, you can filter them out using Pandas tools.

Upvotes: 1

Related Questions