Vyas
Vyas

Reputation: 57

process columns in pandas dataframe

I have a dataframe df:

  Col1 Col2 Col3
0   a1  NaN  NaN
1   a2   b1  NaN
2   a3   b3   c1
3   a4  NaN   c2

I have tried :

new_df = '[' + df + ']'

new_df['Col4']=new_df[new_df.columns[0:]].apply(lambda x:','.join(x.dropna().astype(str)),axis =1)

df_final = pd.concat([df, new_df['col4']], axis =1)

I am getting at this :

wrong dataframe

I was looking for a robust solution to get to something which must look like this:

expected resultant frame

I know there is no direct way to do this, the data frame eventually is going to be at least 20k rows and so the question to fellow stack-people.

Thanks.

let me know if you have any more questions and I can edit the question to add points.

Upvotes: 0

Views: 2723

Answers (3)

jezrael
jezrael

Reputation: 862681

You can add [] for all columns without first not missing value tested with helper i from enumerate:

def f(x):
    gen = (y for y in x if pd.notna(y))
    return ','.join(y if i == 0 else '['+y+']' for i, y in enumerate(gen))

#f = lambda x: ','.join(y if i == 0 else '['+y+']' for i, y in enumerate(x.dropna()))
df['col4'] = df.apply(f, axis=1)
print (df)

  Col1 Col2 Col3 Col4          col4
0   a1  NaN   d8  NaN       a1,[d8]
1   a2   b1   d3  NaN  a2,[b1],[d3]
2  NaN   b3   c1  NaN       b3,[c1]
3   a4  NaN   c2  NaN       a4,[c2]
4  NaN  NaN   c6   d5       c6,[d5]

Performance test:

#test for 25k rows
df = pd.concat([df] * 5000, ignore_index=True)

f1 = lambda x: ','.join(y if i == 0 else '['+y+']' for i, y in enumerate(x.dropna()))
%timeit df.apply(f1, axis=1)
3.62 s ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.apply(f, axis =1)
475 ms ± 3.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

Upvotes: 3

Enrico Gandini
Enrico Gandini

Reputation: 1015

new_col = []
for idx, row in df.iterrows():
    
    val1 = row["Col1"]
    val2 = row["Col2"]
    val3 = row["Col3"]
    
    new_val2 = f",[{val2}]" if pd.notna(val2) else ""
    new_val3 = f",[{val3}]" if pd.notna(val3) else ""
    
    val4 = f"{val1}{new_val2}{new_val3}"
    new_col.append(val4)

df["Col4"] = new_col

Maybe my answer is not the most "computationally efficient", but if your dataset is 20k rows, it will be fast enough! I think my answer is very easy to read, and it is also easy to adapt it to different scenarios!

Upvotes: 1

Lukas Schmid
Lukas Schmid

Reputation: 1960

I'm not sure what your usecase is, but here you go

df['Col4'] = df.apply(lambda row:", ".join([(val if val[0]=='a' else "["+val+"]") for val in row if not pd.isna(val)]), axis=1)

It joins the rows together, by concatenating their values with ", ".join, but only if they are not pd.isna. It further puts everything in brackets that does not begin with a.

Whatever you want to do with it, there probably is a better solution though

Upvotes: 2

Related Questions