Reputation: 65
I have a DF with about 50 columns. 5 of them contain strings that I want to combine into a single column, separating the strings with commas but also keeping the spaces within each of the strings. Moreover, some values are missing (NaN). The last requirement would be to remove duplicates if they exist.
So I have something like this in my DF:
symptom_1 | symptom_2 | symptom_3 | symptom_4 | symptom 5 |
---|---|---|---|---|
muscle pain | super headache | diarrhea | Sore throat | Fatigue |
super rash | ulcera | super headache | ||
diarrhea | super diarrhea | |||
something awful | something awful |
And I need something like this:
symptom_1 | symptom_2 | symptom_3 | symptom_4 | symptom 5 | all_symptoms |
---|---|---|---|---|---|
muscle pain | super headache | diarrhea | Sore throat | Fatigue | muscle pain, super headache, diarrhea, Sore throat, Fatigue |
super rash | ulcera | super headache | super rash, ulcera, headache | ||
diarrhea | super diarrhea | diarrhea, super diarrhea | |||
something awful | something awful | something awful |
I wrote the following function and while it merges all the columns it does not respect the spaces within the original strings, which is a must.
def merge_columns_into_one(DataFrame, columns_to_combine, new_col_name, drop_originals = False):
DataFrame[new_col_name] = DataFrame[columns_to_combine].apply(lambda x: ','.join(x.dropna().astype(str)),axis=1)
return DataFrame
Thanks in advance for your help!
edit: when I'm writing this question the second markdown table appears just fine in the preview, but as soon as I post it the table loses it's format. I hope you get the idea of what I'm trying to do. Else I'd appreciate your feedback on how to fix the MD table.
Upvotes: 1
Views: 1035
Reputation: 24314
Just use fillna()
, apply()
and rstrip()
method:
df['all_symptoms']=df1.fillna('').apply(pd.unique,1).apply(','.join).str.rstrip(',')
Now if you print df
you will get your desired output:
symptom_1 | symptom_2 | symptom_3 | symptom_4 | symptom 5 | all_symptoms |
---|---|---|---|---|---|
muscle pain | super headache | diarrhea | Sore throat | Fatigue | muscle pain, super headache, diarrhea, Sore throat, Fatigue |
super rash | ulcera | super headache | super rash, ulcera, headache | ||
diarrhea | super diarrhea | diarrhea, super diarrhea | |||
something awful | something awful | something awful |
Upvotes: 2
Reputation: 490
Hope below code might help you
import pandas as pd
data = {
'symptom_1' : ["muscle pain", "super rash", "diarrhea", "something awful"],
'symptom_2' : ["super headache", "ulcera", "super diarrhea", "something awful"],
'symptom_3' :["diarrhea", "super headache"],
'symptom_4' : ["Sore throat"],
'symptom_5' :["Fatigue"]
}
df = pd.DataFrame (data, columns = ['symptom_1','symptom_2'])
df1 = pd.DataFrame (data, columns = ['symptom_3'])
df2 = pd.DataFrame (data, columns = ['symptom_4','symptom_5'])
new = pd.concat([df, df1, df2], axis=1)
new['symptom_6'] = new['symptom_1']+","+new['symptom_2']+","+new['symptom_3'].fillna('')+","+new['symptom_4'].fillna('')+","+new['symptom_5'].fillna('')
print(new)
Upvotes: 0
Reputation: 1994
First, apply lambda
function, then use set()
to collect all the symptoms ignoring repetitions, finally, use simple list comprehension and then join the list elements with a comma using join()
.
df['all_symptoms'] = df.apply(lambda row: ",".join([x for x in set(row) if x is not None]),1)
This will return all the symptoms separated by a comma.
Upvotes: 0
Reputation: 28644
You can use pandas str.cat, with some massaging:
(df
.fillna("")
.assign(all_symptoms = lambda df: df.iloc[:, 0]
.str.cat(df.iloc[:, 1:],
sep=',')
.str.strip(",")
.str.split(",")
.map(pd.unique)
.str.join(","))
)
symptom_1 symptom_2 symptom_3 symptom_4 symptom 5 all_symptoms
0 muscle pain super headache diarrhea Sore throat Fatigue muscle pain,super headache,diarrhea,Sore throa...
1 super rash ulcera super headache super rash,ulcera,super headache
2 diarrhea super diarrhea diarrhea,super diarrhea
3 something awful something awful something awful
Alternatively, you could run the string operations within plain python, which is usually faster than pandas string methods (they are wrappers around python's string methods anyways):
df = df.fillna("")
_, strings = zip(*df.items())
strings = zip(*strings)
strings = map(pd.unique, strings)
strings = map(",".join, strings)
df['all_symptoms'] = [entry.strip(",") for entry in strings]
Upvotes: 2
Reputation: 215
Here is an example that you can get an idea how to work around it:
import pandas as pd
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3],)
df1["merged"] = df1["A"]+"," + df1["B"]+","+df1["C"]+","+df1["D"]
Upvotes: 0