Reputation: 473

I wrote a script which makes a new dataframe based on a condition, how can I make it more efficient ?

Here's my code

df = *some df 
coin = *some string
color = 'red'

events = pd.DataFrame()
events['date'] = df.date
events['event'] = np.NaN

data_list = []
for i in range(len(df)):
    if df.iloc[i].color == color:
        data_list.append(1)
    else:
        data_list.append(np.NaN)

events['event'] = l

from the original data frame if the color == 'red' the corresponding date in the new dataframe (events) should be 1 otherwise NaN.

I know you can probably do it in one line but I'm not sure how

bonus question after performing this I reset the index to the date column, something I cant do before because iloc doesn't work with range i-> len(df)

events = pd.DataFrame()
events[coin] = data_list
events = events.set_index(events['date'].values)

data = pd.DataFrame()
data[coin] = df.close
data = data.set_index(events['date'].values)
data = {'close':data}

Upvotes: 0

Answers (2)

jezrael

Reputation: 863166

You need numpy.where:

df['event'] = np.where(df.color == 'red', 1, np.nan)

Sample:

df = pd.DataFrame({'color' : ['red', 'blue'],
                    'd'  : ['a', 'b']})

print (df)
  color  d
0   red  a
1  blue  b

df['event'] = np.where(df.color == 'red', 1, np.nan)
print (df)
  color  d  event
0   red  a    1.0
1  blue  b    NaN

Another solution:

df.loc[df.color == 'red', 'event'] = 1
print (df)
  color  d  event
0   red  a    1.0
1  blue  b    NaN

Performance is similar:

df = pd.DataFrame({'color' : ['red', 'blue'],
                    'd'  : ['a', 'b']})
df = pd.concat([df]*100000).reset_index(drop=True)
print (df)

In [31]: %timeit df['event1'] = np.where(df.color == 'red', 1, np.nan)
10 loops, best of 3: 23.6 ms per loop

In [32]: %timeit df.loc[df.color == 'red', 'event'] = 1
10 loops, best of 3: 25.4 ms per loop

Upvotes: 1

Woody Pride

Reputation: 13965

Lots of different ways to do this

e.g. build a series using list comprehension

import pandas as pd
import numpy as np
df = pd.DataFrame({'color' : ['red', 'blue', 'red'],
                   'date'  : ['3/10/17', '4/10/17', '5/10/17']})

color_bools = pd.Series([1 if val == 'red' else np.nan for val in df['color']], 
                         index = df['date'].values)
color_bools

Out[18]:
3/10/17    1.0
4/10/17    NaN
5/10/17    1.0
dtype: float64

Upvotes: 1

I wrote a script which makes a new dataframe based on a condition, how can I make it more efficient ?

Answers (2)

Related Questions