Raúl Casado
Raúl Casado

Reputation: 445

Pandas data frame from csv. Columns with same name

I have a csv with lot of columns (1314):

ColumnA   ColumnA   ColumnA   ColumnB   ColumnC   ColumnB   ColumnM
      5         9         5         1         6         8         9
      5         1         3         5         8         6         8

I would like to group by column summarizing the values, but when I try to get a data frame from this csv, the columns change their names to:

ColumnA   ColumnA.1   ColumnA.2   ColumnB   ColumnC   ColumnB.1   ColumnM
      5           9           5         1         6           8         9
      5           1           3         5         8           6         8

So I can't group by columns...

Is there any way to create a data frame from this csv keeping the name of the columns?

Upvotes: 0

Views: 441

Answers (2)

jezrael
jezrael

Reputation: 863256

Use Series.str.split with indexing by str:

df.columns = df.columns.str.split('.').str[0]
print (df)
   ColumnA  ColumnA  ColumnA  ColumnB  ColumnC  ColumnB  ColumnM
0        5        9        5        1        6        8        9
1        5        1        3        5        8        6        8

If want use groupby then not necessary remove them:

df = df.groupby(lambda x: x.split('.')[0], axis=1).sum()
print (df)
  ColumnA  ColumnB  ColumnC  ColumnM
0       19        9        6        9
1        9       11        8        8

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150785

If you actual column name does not contain ., you can change the column name back with:

df = pd.read_csv(path_to_csv)
df.columns = df.columns.str.extract('^([^\.]*)')[0].values

Output:

   ColumnA  ColumnA  ColumnA  ColumnB  ColumnC  ColumnB  ColumnM
0        5        9        5        1        6        8        9
1        5        1        3        5        8        6        8

Upvotes: 1

Related Questions