HackXIt
HackXIt

Reputation: 542

Join columns which were mangled on import

When using pd.read_csv('myfile.csv', delimiter=';') on a csv which duplicated column names, pandas mangles the duplicated columns with .1, .2, .# (# is the number of the duplicated column)

My example csv looks like this:

data1 data2 A B B C C
abc NaN text1 text2 text3 text4 text5
def 456 text2 text4 text3 text5 text1
Data1;Data2;A;B;B;C;C
abc;;text1;text2;text3;text4;text5
def;456;text2;text4;text3;text5;text1

After import to dataframe, the duplicated columns get mangled:

mangled_columns

This output is expected.

But I wish to combine these duplicated columns and their rows as comma-seperated strings.

So the desired output would look like: (order of columns is not important)

data1 data2 A B C
abc 123 text1 text2,text3 text4,text5
def 456 text2 text4,text3 text5,text1

How can I achieve that with pandas in python?

I found the following question when searching for the problem:

Concatenate cells into a string with separator pandas python

But I don't know how to apply the answer from that question to only those columns which are mangled.

Upvotes: 0

Views: 132

Answers (3)

w-m
w-m

Reputation: 11232

Here's another solution, grouping the mangled columns with a regular expression:

# (.*?): capture the original column name at the beginning of the string
# potentially followed by a dot and at least one digit
shared_groupname = r"(.*?)(?:\.\d+)?$"

Let's see that in action:

>>> df.columns.str.extract(shared_groupname) 
       0
0  Data1
1  Data2
2      A
3      B
4      B
5      C
6      C

Then groupby this original column name, and apply the join:

grouped = df.groupby(df.columns.str.extract(shared_groupname, expand=False), axis=1)

res = grouped.apply(lambda x: x.dropna().astype(str).apply(', '.join, axis=1))

Upvotes: 0

mozway
mozway

Reputation: 261900

You can rename your columns, stack, join, unstack:

df = pd.read_csv('filename.csv', sep=';')

# remove the '.x' in columns
df.columns = df.columns.map(lambda x: x.split('.')[0])

# reshaping
(df.set_index(['Data1', 'Data2']) # set those columns aside
   .stack()                       # columns to rows
   .groupby(level=[0,1,2])        # group by all
   .apply(','.join)               # join duplicates
   .unstack()                     # A/B/C back to columns
)

output:

                 A            B            C
Data1 Data2                                 
abc   def    text1  text2,text3  text4,text5
asd   fgh    text2  text4,text3  text5,text1

Upvotes: 2

ChrisOram
ChrisOram

Reputation: 1434

Taking inspiration from this SO thread:

import pandas as pd

df = pd.read_csv(r'./example.csv', delimiter=';')

def sjoin(x): 
    return ';'.join(x[x.notnull()].astype(str))

df = df.groupby(lambda col: col.split('.')[0], axis=1).apply(lambda x: x.apply(sjoin, axis=1))

Which results in:

       A            B            C Data1 Data2
0  text1  text2;text3  text4;text5   abc   def
1  text2  text4;text3  text5;text1   asd   fgh

Upvotes: 0

Related Questions