Reputation: 542
When using pd.read_csv('myfile.csv', delimiter=';')
on a csv which duplicated column names, pandas mangles the duplicated columns with .1
, .2
, .#
(# is the number of the duplicated column)
My example csv looks like this:
data1 | data2 | A | B | B | C | C |
---|---|---|---|---|---|---|
abc | NaN | text1 | text2 | text3 | text4 | text5 |
def | 456 | text2 | text4 | text3 | text5 | text1 |
Data1;Data2;A;B;B;C;C
abc;;text1;text2;text3;text4;text5
def;456;text2;text4;text3;text5;text1
After import to dataframe, the duplicated columns get mangled:
This output is expected.
But I wish to combine these duplicated columns and their rows as comma-seperated strings.
So the desired output would look like: (order of columns is not important)
data1 | data2 | A | B | C |
---|---|---|---|---|
abc | 123 | text1 | text2,text3 | text4,text5 |
def | 456 | text2 | text4,text3 | text5,text1 |
How can I achieve that with pandas in python?
I found the following question when searching for the problem:
Concatenate cells into a string with separator pandas python
But I don't know how to apply the answer from that question to only those columns which are mangled.
Upvotes: 0
Views: 132
Reputation: 11232
Here's another solution, grouping the mangled columns with a regular expression:
# (.*?): capture the original column name at the beginning of the string
# potentially followed by a dot and at least one digit
shared_groupname = r"(.*?)(?:\.\d+)?$"
Let's see that in action:
>>> df.columns.str.extract(shared_groupname)
0
0 Data1
1 Data2
2 A
3 B
4 B
5 C
6 C
Then groupby this original column name, and apply the join:
grouped = df.groupby(df.columns.str.extract(shared_groupname, expand=False), axis=1)
res = grouped.apply(lambda x: x.dropna().astype(str).apply(', '.join, axis=1))
Upvotes: 0
Reputation: 261900
You can rename your columns, stack, join, unstack:
df = pd.read_csv('filename.csv', sep=';')
# remove the '.x' in columns
df.columns = df.columns.map(lambda x: x.split('.')[0])
# reshaping
(df.set_index(['Data1', 'Data2']) # set those columns aside
.stack() # columns to rows
.groupby(level=[0,1,2]) # group by all
.apply(','.join) # join duplicates
.unstack() # A/B/C back to columns
)
output:
A B C
Data1 Data2
abc def text1 text2,text3 text4,text5
asd fgh text2 text4,text3 text5,text1
Upvotes: 2
Reputation: 1434
Taking inspiration from this SO thread:
import pandas as pd
df = pd.read_csv(r'./example.csv', delimiter=';')
def sjoin(x):
return ';'.join(x[x.notnull()].astype(str))
df = df.groupby(lambda col: col.split('.')[0], axis=1).apply(lambda x: x.apply(sjoin, axis=1))
Which results in:
A B C Data1 Data2
0 text1 text2;text3 text4;text5 abc def
1 text2 text4;text3 text5;text1 asd fgh
Upvotes: 0