rahul desai
rahul desai

Reputation: 59

for loop optimization for dataframe

Based on the row value of my original dataframe I need to change another dataframe row value. This code works but execution time is very high.

I tried multiple form of for loop and functions (iterrows, iteritems, apply) but it didn't help.

Here's my code:

%%timeit
for value in tqdm(range(len(data['DPS_NUM']))):
    for col_nm in ts_col:
        temp = data[col_nm][value]
        if temp != '':
            data2[temp][value] = 1

Original dataframe:

col1 col2 col3 col4
123  foo  bar  zoo
456  bar  foo
789  zoo  zoo

Expected dataframe:

col1 foo bar zoo
123   1   1   1
456   1   1   1
789           1

My code works but it's slow, I need to optimize it.

Upvotes: 0

Views: 79

Answers (1)

jezrael
jezrael

Reputation: 862406

Use get_dummies and aggregate max per columns:

#if first column is index
df = pd.get_dummies(df, prefix ='', prefix_sep='').max(axis=1, level=0)
print (df)
      bar  foo  zoo
col1               
123     1    1    1
456     1    1    0
789     0    0    1


#if first column is not index
#df = pd.get_dummies(df.set_index('col1'), prefix ='', prefix_sep='').max(axis=1, level=0)

Upvotes: 1

Related Questions