Reputation: 1455
I am using a rather large dataset of ~37 million data points that are hierarchically indexed into three categories country, productcode, year. The country variable (which is the countryname) is rather messy data consisting of items such as: 'Austral' which represents 'Australia'. I have built a simple guess_country() that matches letters to words, and returns a best guess and confidence interval from a known list of country_names. Given the length of the data and the nature of hierarchy it is very inefficient to use .map() to the Series: country. [The guess_country function takes ~2ms / request]
My question is: Is there a more efficient .map() which takes the Series and performs map on only unique values? (Given there are a LOT of repeated countrynames)
Upvotes: 0
Views: 3320
Reputation: 97291
Call guess_country()
on unique country names, and make a country_map
Series object with the original name as the index, converted name as the value. Then you can use country_map[df.country]
to do the conversion.
import pandas as pd
c = ["abc","abc","ade","ade","ccc","bdc","bxy","ccc","ccx","ccb","ccx"]
v = range(len(c))
df = pd.DataFrame({"country":c, "data":v})
def guess_country(c):
return c[0]
uc = df.country.unique()
country_map = pd.Series(list(map(guess_country, uc)), index=uc)
df["country_id"] = country_map[df.country].values
print(df)
Upvotes: 0
Reputation: 1455
On Solution is to make use of the Hierarchical Indexing in DataFrame!
data = data.set_index(keys=['COUNTRY', 'PRODUCTCODE', 'YEAR'])
data.index.levels[0] = pd.Index(data.index.levels[0].map(lambda x: guess_country(x, country_names)[0]))
This works well ... by replacing the data.index.levels[0] -> when COUNTRY is level 0 in the index, replacement then which propagates through the data model.
Upvotes: 0
Reputation: 251383
There isn't, but if you want to only apply to unique values, just do that yourself. Get mySeries.unique()
, then use your function to pre-calculate the mapped alternatives for those unique values and create a dictionary with the resulting mappings. Then use pandas map
with the dictionary. This should be about as fast as you can expect.
Upvotes: 3