Reputation: 542

Join columns which were mangled on import

When using pd.read_csv('myfile.csv', delimiter=';') on a csv which duplicated column names, pandas mangles the duplicated columns with .1, .2, .# (# is the number of the duplicated column)

My example csv looks like this:

data1	data2	A	B	B	C	C
abc	NaN	text1	text2	text3	text4	text5
def	456	text2	text4	text3	text5	text1

Data1;Data2;A;B;B;C;C
abc;;text1;text2;text3;text4;text5
def;456;text2;text4;text3;text5;text1

After import to dataframe, the duplicated columns get mangled:

This output is expected.

But I wish to combine these duplicated columns and their rows as comma-seperated strings.

So the desired output would look like: (order of columns is not important)

data1	data2	A	B	C
abc	123	text1	text2,text3	text4,text5
def	456	text2	text4,text3	text5,text1

How can I achieve that with pandas in python?

I found the following question when searching for the problem:

Concatenate cells into a string with separator pandas python

But I don't know how to apply the answer from that question to only those columns which are mangled.

Upvotes: 0

Answers (3)

w-m

Reputation: 11232

Here's another solution, grouping the mangled columns with a regular expression:

# (.*?): capture the original column name at the beginning of the string
# potentially followed by a dot and at least one digit
shared_groupname = r"(.*?)(?:\.\d+)?$"

Let's see that in action:

>>> df.columns.str.extract(shared_groupname) 
       0
0  Data1
1  Data2
2      A
3      B
4      B
5      C
6      C

Then groupby this original column name, and apply the join:

grouped = df.groupby(df.columns.str.extract(shared_groupname, expand=False), axis=1)

res = grouped.apply(lambda x: x.dropna().astype(str).apply(', '.join, axis=1))

Upvotes: 0

mozway

Reputation: 261900

You can rename your columns, stack, join, unstack:

df = pd.read_csv('filename.csv', sep=';')

# remove the '.x' in columns
df.columns = df.columns.map(lambda x: x.split('.')[0])

# reshaping
(df.set_index(['Data1', 'Data2']) # set those columns aside
   .stack()                       # columns to rows
   .groupby(level=[0,1,2])        # group by all
   .apply(','.join)               # join duplicates
   .unstack()                     # A/B/C back to columns
)

output:

                 A            B            C
Data1 Data2                                 
abc   def    text1  text2,text3  text4,text5
asd   fgh    text2  text4,text3  text5,text1

Upvotes: 2

ChrisOram

Reputation: 1434

Taking inspiration from this SO thread:

import pandas as pd

df = pd.read_csv(r'./example.csv', delimiter=';')

def sjoin(x): 
    return ';'.join(x[x.notnull()].astype(str))

df = df.groupby(lambda col: col.split('.')[0], axis=1).apply(lambda x: x.apply(sjoin, axis=1))

Which results in:

       A            B            C Data1 Data2
0  text1  text2;text3  text4;text5   abc   def
1  text2  text4;text3  text5;text1   asd   fgh

Upvotes: 0

Join columns which were mangled on import

Answers (3)

Related Questions