Reputation: 65

How to combine DataFrame columns of strings into a single column?

I have a DF with about 50 columns. 5 of them contain strings that I want to combine into a single column, separating the strings with commas but also keeping the spaces within each of the strings. Moreover, some values are missing (NaN). The last requirement would be to remove duplicates if they exist.

So I have something like this in my DF:

symptom_1	symptom_2	symptom_3	symptom_4	symptom 5
muscle pain	super headache	diarrhea	Sore throat	Fatigue
super rash	ulcera	super headache
diarrhea	super diarrhea
something awful	something awful

And I need something like this:

symptom_1	symptom_2	symptom_3	symptom_4	symptom 5	all_symptoms
muscle pain	super headache	diarrhea	Sore throat	Fatigue	muscle pain, super headache, diarrhea, Sore throat, Fatigue
super rash	ulcera	super headache			super rash, ulcera, headache
diarrhea	super diarrhea				diarrhea, super diarrhea
something awful	something awful				something awful

I wrote the following function and while it merges all the columns it does not respect the spaces within the original strings, which is a must.

def merge_columns_into_one(DataFrame, columns_to_combine, new_col_name, drop_originals = False):
    DataFrame[new_col_name] = DataFrame[columns_to_combine].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
    return DataFrame

Thanks in advance for your help!

edit: when I'm writing this question the second markdown table appears just fine in the preview, but as soon as I post it the table loses it's format. I hope you get the idea of what I'm trying to do. Else I'd appreciate your feedback on how to fix the MD table.

Upvotes: 1

Answers (5)

Anurag Dabas

Reputation: 24314

Just use fillna() , apply() and rstrip() method:

df['all_symptoms']=df1.fillna('').apply(pd.unique,1).apply(','.join).str.rstrip(',')

Now if you print df you will get your desired output:

symptom_1	symptom_2	symptom_3	symptom_4	symptom 5	all_symptoms
muscle pain	super headache	diarrhea	Sore throat	Fatigue	muscle pain, super headache, diarrhea, Sore throat, Fatigue
super rash	ulcera	super headache			super rash, ulcera, headache
diarrhea	super diarrhea				diarrhea, super diarrhea
something awful	something awful				something awful

Upvotes: 2

Deepak

Reputation: 490

Hope below code might help you

import pandas as pd

data = {
        'symptom_1' : ["muscle pain", "super rash", "diarrhea", "something awful"], 
        'symptom_2' : ["super headache", "ulcera", "super diarrhea", "something awful"],
        'symptom_3' :["diarrhea", "super headache"],
        'symptom_4' : ["Sore throat"],
        'symptom_5' :["Fatigue"]
        }
df = pd.DataFrame (data, columns = ['symptom_1','symptom_2']) 
df1 = pd.DataFrame (data, columns = ['symptom_3'])
df2 = pd.DataFrame (data, columns = ['symptom_4','symptom_5'])

new = pd.concat([df, df1, df2], axis=1) 

new['symptom_6'] = new['symptom_1']+","+new['symptom_2']+","+new['symptom_3'].fillna('')+","+new['symptom_4'].fillna('')+","+new['symptom_5'].fillna('')

print(new)

Upvotes: 0

Ransaka Ravihara

Reputation: 1994

First, apply lambda function, then use set() to collect all the symptoms ignoring repetitions, finally, use simple list comprehension and then join the list elements with a comma using join().

df['all_symptoms'] = df.apply(lambda row: ",".join([x for x in set(row) if x is not None]),1)

This will return all the symptoms separated by a comma.

Upvotes: 0

sammywemmy

Reputation: 28644

You can use pandas str.cat, with some massaging:

(df
 .fillna("")
 .assign(all_symptoms = lambda df: df.iloc[:, 0]
                                    .str.cat(df.iloc[:, 1:], 
                                             sep=',')
                                    .str.strip(",")
                                    .str.split(",")
                                    .map(pd.unique)
                                    .str.join(","))
         )

         symptom_1        symptom_2       symptom_3    symptom_4 symptom 5                                       all_symptoms
0      muscle pain   super headache        diarrhea  Sore throat   Fatigue  muscle pain,super headache,diarrhea,Sore throa...
1       super rash           ulcera  super headache                                          super rash,ulcera,super headache
2         diarrhea   super diarrhea                                                                   diarrhea,super diarrhea
3  something awful  something awful                                                                           something awful

Alternatively, you could run the string operations within plain python, which is usually faster than pandas string methods (they are wrappers around python's string methods anyways):

df = df.fillna("")

_, strings = zip(*df.items())

strings = zip(*strings)

strings = map(pd.unique, strings)

strings = map(",".join, strings)

df['all_symptoms'] = [entry.strip(",") for entry in strings]

Upvotes: 2

Hadi Rohani

Reputation: 215

Here is an example that you can get an idea how to work around it:

import pandas as pd

df1 = pd.DataFrame(
     {
       "A": ["A0", "A1", "A2", "A3"],
       "B": ["B0", "B1", "B2", "B3"],
       "C": ["C0", "C1", "C2", "C3"],
       "D": ["D0", "D1", "D2", "D3"],
     },
index=[0, 1, 2, 3],)

df1["merged"] = df1["A"]+"," + df1["B"]+","+df1["C"]+","+df1["D"]

Upvotes: 0

How to combine DataFrame columns of strings into a single column?

Answers (5)

Related Questions