Francisco Cortes
Francisco Cortes

Reputation: 1226

How to speed up currency conversion while getting historical exchange rates

I need some help thinking through this:

I have a dataset with 61K records of services. Each service gets renewed on a specific date, each service also has a cost and that cost amount is billed in one of 10 different currencies.

what I need to do on each service record is to convert each service cost to CAD currency for the date the service was renewed.

when I do this in a small sample dataset with 6 services it takes 3 seconds, but this implies that if I do this on a 61k record dataset it might take over 8 hours, which is way too long (i think I can do that in excel or google sheets way faster, which I don't want to do)

Is there a better way or approach to do this with pandas/python in google colab so it doesn't take that long?

thank you in advance

# setup
import pandas as pd
!pip install forex-python
from forex_python.converter import CurrencyRates

#sample dataset/df

dummy_data = {
        'siteid': ['11', '12', '13', '41', '42','51'],
        'userid': [0,0,0,0,0,0],
        'domain': ['A',  'B',  'C',  'E',  'F', 'G'],
        'currency':['MXN',  'CAD',  'USD',  'USD',  'AUD', 'HKD'],
        'servicecost': [2.5,  3.3,  1.3,  2.5,  2.5, 2.3],
        'date': ['2022-02-04',  '2022-03-05',  '2022-01-03',  '2021-04-06',  '2022-12-05', '2022-11-01']
        }
df = pd.DataFrame(dummy_data, columns = ['siteid', 'userid', 'domain','currency','servicecost','date'])

#ensure date is in the proper datatype
df['date'] = pd.to_datetime(df['date'],errors='coerce')

#go through df, get the data to do the conversion and populate a new series
def convertServiceCostToCAD(currency,servicecost,date):
  return CurrencyRates().convert(currency, 'CAD', servicecost, date)
df['excrate']=list(map(convertServiceCostToCAD, df['currency'], df['servicecost'], df['date']))

Upvotes: 2

Views: 872

Answers (1)

EliasK93
EliasK93

Reputation: 3174

So if I understand this correctly what this package does is provide a daily fixed rate between two currencies (so one direction is the inverse of the other direction).

And what makes things so slow is very clearly the calls to the packages methods. For me around ~4 seconds per call.

And you always are interested in finding out what is the rate between currency x and CAD.

The package has a method .get_rates() which seems to provide the same information used by the .convert() method, but for one currency and all others.

So what you can do is:

  1. Collect all unique dates in the DataFrame
  2. Call .get_rates() for each of those dates and save the result
  3. Use the results plus your amounts to calculate the required column

E.g. as follows:

import pandas as pd
from forex_python.converter import CurrencyRates
from tqdm import tqdm  # use 'pip install tqdm' before


df = pd.DataFrame({
    'siteid': ['11', '12', '13', '41', '42', '51'],
    'userid': [0, 0, 0, 0, 0, 0],
    'domain': ['A', 'B', 'C', 'E', 'F', 'G'],
    'currency': ['MXN', 'CAD', 'USD', 'USD', 'AUD', 'HKD'],
    'servicecost': [2.5, 3.3, 1.3, 2.5, 2.5, 2.3],
    'date': ['2022-02-04', '2022-03-05', '2022-01-03', '2021-04-06', '2022-12-05', '2022-11-01']
})

# get rates for all unique dates, added tqdm progress bar to see progress
rates_dict = {date: CurrencyRates().get_rates('CAD', date_obj=pd.to_datetime(date, errors='coerce'))
              for date in tqdm(df['date'].unique())}

# now use these rates to set cost to 1/(CAD to currency_x rate), except when currency is CAD and when servicecost is 0, in those cases just use servicecost
df['excrate'] = df.apply(lambda row: 1.0/rates_dict[row['date']][row['currency']]*row['servicecost'] if row['currency']!='CAD' and row['servicecost'] != 0 else row['servicecost'], axis=1)

print(df)
>   siteid  userid domain currency  servicecost        date   excrate
  0     11       0      A      MXN          2.5  2022-02-04  0.154553
  1     12       0      B      CAD          3.3  2022-03-05  3.300000
  2     13       0      C      USD          1.3  2022-01-03  1.670334
  3     41       0      E      USD          2.5  2021-04-06  3.140874
  4     42       0      F      AUD          2.5  2022-12-05  2.219252
  5     51       0      G      HKD          2.3  2022-11-01  0.380628

How much this speeds up things drastically depends on how many different dates there are in your data. But since you said the original DataFrame has 60k rows I assume there are large numbers of dates occuring multiple times. This code should take roughly ~4seconds * number of unique dates in your DataFrame to run.

Upvotes: 2

Related Questions