sanguineturtle
sanguineturtle

Reputation: 1455

Pandas: More Efficient .map() function or method?

I am using a rather large dataset of ~37 million data points that are hierarchically indexed into three categories country, productcode, year. The country variable (which is the countryname) is rather messy data consisting of items such as: 'Austral' which represents 'Australia'. I have built a simple guess_country() that matches letters to words, and returns a best guess and confidence interval from a known list of country_names. Given the length of the data and the nature of hierarchy it is very inefficient to use .map() to the Series: country. [The guess_country function takes ~2ms / request]

My question is: Is there a more efficient .map() which takes the Series and performs map on only unique values? (Given there are a LOT of repeated countrynames)

Upvotes: 0

Views: 3320

Answers (3)

HYRY
HYRY

Reputation: 97291

Call guess_country() on unique country names, and make a country_map Series object with the original name as the index, converted name as the value. Then you can use country_map[df.country] to do the conversion.

import pandas as pd
c = ["abc","abc","ade","ade","ccc","bdc","bxy","ccc","ccx","ccb","ccx"]
v = range(len(c))
df = pd.DataFrame({"country":c, "data":v})

def guess_country(c):
    return c[0]

uc = df.country.unique()
country_map = pd.Series(list(map(guess_country, uc)), index=uc)
df["country_id"] = country_map[df.country].values
print(df)

Upvotes: 0

sanguineturtle
sanguineturtle

Reputation: 1455

On Solution is to make use of the Hierarchical Indexing in DataFrame!

data = data.set_index(keys=['COUNTRY', 'PRODUCTCODE', 'YEAR'])
data.index.levels[0] = pd.Index(data.index.levels[0].map(lambda x: guess_country(x, country_names)[0])) 

This works well ... by replacing the data.index.levels[0] -> when COUNTRY is level 0 in the index, replacement then which propagates through the data model.

Upvotes: 0

BrenBarn
BrenBarn

Reputation: 251383

There isn't, but if you want to only apply to unique values, just do that yourself. Get mySeries.unique(), then use your function to pre-calculate the mapped alternatives for those unique values and create a dictionary with the resulting mappings. Then use pandas map with the dictionary. This should be about as fast as you can expect.

Upvotes: 3

Related Questions