Luis
Luis

Reputation: 65

How to combine DataFrame columns of strings into a single column?

I have a DF with about 50 columns. 5 of them contain strings that I want to combine into a single column, separating the strings with commas but also keeping the spaces within each of the strings. Moreover, some values are missing (NaN). The last requirement would be to remove duplicates if they exist.

So I have something like this in my DF:

symptom_1 symptom_2 symptom_3 symptom_4 symptom 5
muscle pain super headache diarrhea Sore throat Fatigue
super rash ulcera super headache
diarrhea super diarrhea
something awful something awful

And I need something like this:

symptom_1 symptom_2 symptom_3 symptom_4 symptom 5 all_symptoms
muscle pain super headache diarrhea Sore throat Fatigue muscle pain, super headache, diarrhea, Sore throat, Fatigue
super rash ulcera super headache super rash, ulcera, headache
diarrhea super diarrhea diarrhea, super diarrhea
something awful something awful something awful

I wrote the following function and while it merges all the columns it does not respect the spaces within the original strings, which is a must.

def merge_columns_into_one(DataFrame, columns_to_combine, new_col_name, drop_originals = False):
    DataFrame[new_col_name] = DataFrame[columns_to_combine].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
    return DataFrame

Thanks in advance for your help!

edit: when I'm writing this question the second markdown table appears just fine in the preview, but as soon as I post it the table loses it's format. I hope you get the idea of what I'm trying to do. Else I'd appreciate your feedback on how to fix the MD table.

Upvotes: 1

Views: 1035

Answers (5)

Anurag Dabas
Anurag Dabas

Reputation: 24314

Just use fillna() , apply() and rstrip() method:

df['all_symptoms']=df1.fillna('').apply(pd.unique,1).apply(','.join).str.rstrip(',')

Now if you print df you will get your desired output:

symptom_1 symptom_2 symptom_3 symptom_4 symptom 5 all_symptoms
muscle pain super headache diarrhea Sore throat Fatigue muscle pain, super headache, diarrhea, Sore throat, Fatigue
super rash ulcera super headache super rash, ulcera, headache
diarrhea super diarrhea diarrhea, super diarrhea
something awful something awful something awful

Upvotes: 2

Deepak
Deepak

Reputation: 490

Hope below code might help you

import pandas as pd

data = {
        'symptom_1' : ["muscle pain", "super rash", "diarrhea", "something awful"], 
        'symptom_2' : ["super headache", "ulcera", "super diarrhea", "something awful"],
        'symptom_3' :["diarrhea", "super headache"],
        'symptom_4' : ["Sore throat"],
        'symptom_5' :["Fatigue"]
        }
df = pd.DataFrame (data, columns = ['symptom_1','symptom_2']) 
df1 = pd.DataFrame (data, columns = ['symptom_3'])
df2 = pd.DataFrame (data, columns = ['symptom_4','symptom_5'])

new = pd.concat([df, df1, df2], axis=1) 

new['symptom_6'] = new['symptom_1']+","+new['symptom_2']+","+new['symptom_3'].fillna('')+","+new['symptom_4'].fillna('')+","+new['symptom_5'].fillna('')

print(new)

Upvotes: 0

Ransaka Ravihara
Ransaka Ravihara

Reputation: 1994

First, apply lambda function, then use set() to collect all the symptoms ignoring repetitions, finally, use simple list comprehension and then join the list elements with a comma using join().

df['all_symptoms'] = df.apply(lambda row: ",".join([x for x in set(row) if x is not None]),1)

This will return all the symptoms separated by a comma.

Upvotes: 0

sammywemmy
sammywemmy

Reputation: 28644

You can use pandas str.cat, with some massaging:

(df
 .fillna("")
 .assign(all_symptoms = lambda df: df.iloc[:, 0]
                                    .str.cat(df.iloc[:, 1:], 
                                             sep=',')
                                    .str.strip(",")
                                    .str.split(",")
                                    .map(pd.unique)
                                    .str.join(","))
         )

         symptom_1        symptom_2       symptom_3    symptom_4 symptom 5                                       all_symptoms
0      muscle pain   super headache        diarrhea  Sore throat   Fatigue  muscle pain,super headache,diarrhea,Sore throa...
1       super rash           ulcera  super headache                                          super rash,ulcera,super headache
2         diarrhea   super diarrhea                                                                   diarrhea,super diarrhea
3  something awful  something awful                                                                           something awful

Alternatively, you could run the string operations within plain python, which is usually faster than pandas string methods (they are wrappers around python's string methods anyways):

df = df.fillna("")

_, strings = zip(*df.items())

strings = zip(*strings)

strings = map(pd.unique, strings)

strings = map(",".join, strings)

df['all_symptoms'] = [entry.strip(",") for entry in strings]

Upvotes: 2

Hadi Rohani
Hadi Rohani

Reputation: 215

Here is an example that you can get an idea how to work around it:

import pandas as pd

df1 = pd.DataFrame(
     {
       "A": ["A0", "A1", "A2", "A3"],
       "B": ["B0", "B1", "B2", "B3"],
       "C": ["C0", "C1", "C2", "C3"],
       "D": ["D0", "D1", "D2", "D3"],
     },
index=[0, 1, 2, 3],)

df1["merged"] = df1["A"]+"," + df1["B"]+","+df1["C"]+","+df1["D"]

Upvotes: 0

Related Questions