llamallama101092030
llamallama101092030

Reputation: 81

Vectorizing a For Loop with a User Defined Function

I have a dataframe which, for simplicity, looks like so:

import pandas as pd
!pip install dfply
from dfply import *


stocks = {'Ticker': ['AA', 'AA', 'BB', 'BB'], 'Date': ['2019-01-01', '2019-01-03', '2012-01-01', '2012-01-03']}
stocks = pd.DataFrame.from_dict(stocks))
print(stocks)


Ticker    Date
AA        2019-01-01
AA        2019-01-03
BB        2012-01-01
BB        2012-01-03

One issue I have is that the dataset I have created has some stocks which became public at different dates than others, so I am trying to create a column which specifies the first date in which the ticker symbol appears. To do so, I use the following for loop:

def beginDate(stock):     #returns the first day that stock is on market in dataset
    return(min(
    (stocks >> mask(stocks['Ticker'] == stock))['Date']
    ))

begin_market = []
for i in range(len(stocks)):
    beginning_date = beginDate(stocks['Ticker'][i])
    begin_market.append(begin_market)

I aim to have results that would look like so:

Ticker    Date         begin_market
AA        2019-01-01   2019-01-01
AA        2019-01-03   2019-01-01
BB        2012-01-01   2012-01-01
BB        2012-01-03   2012-01-01

The issue I am running into is that I have a large dataframe, and so the for loop is taking forever to complete. I am reading up on vectorizing the for loop in order to gain computational speed, but I am confused in how exactly it works besides with if/else type statements. Would anyone know how you might vectorized this in order to boost its speed? If so, I would greatly appreciate it!

Upvotes: 2

Views: 60

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195408

If I understand you right, you want to groupby by "Ticker" and then select earliest date from that group:

stocks = {'Ticker': ['AA', 'AA', 'BB', 'BB'], 'Date': ['2019-01-01', '2019-01-03', '2012-01-01', '2012-01-03']}
stocks = pd.DataFrame.from_dict(stocks)
stocks['Date'] = pd.to_datetime(stocks['Date'])

stocks['First Appearance'] = stocks.groupby('Ticker')['Date'].transform('min')
print(stocks)

Prints:

  Ticker       Date First Appearance
0     AA 2019-01-01       2019-01-01
1     AA 2019-01-03       2019-01-01
2     BB 2012-01-01       2012-01-01
3     BB 2012-01-03       2012-01-01

Upvotes: 1

Related Questions