Reputation: 7265
I have multiindex mapping rules, here's the rules
Type A: Chicken, Beef, Goat
Type B: Fish, Shrimp
Type C: Chicken, Pork
I here's my dataframe, let say this is a df
dataframe, and want to do multi index mapping
id Menu
1 Fried Chicken
2 Shrimp Chips
3 Pork with Cheese
4 Fish Spaghetti
5 Goat Sate
6 Beef Soup
With pandas
, it can be use pd.MultiIndex
and pd.Series.str.get_dummies
. So, the code will be
from pandas import pd
from numpy.core.defchararray import find
m = {
'Type A': ['Chicken', 'Beef', 'Goat'],
'Type B': ['Fish', 'Shrimp'],
'Type C': ['Chicken', 'Pork']
}
mux = pd.MultiIndex.from_tuples(
[(k, v) for k, values in m.items() for v in values])
df.join(
df.Menu.str.get_dummies(sep=' ') \
.reindex(columns=mux, level=1).max(axis=1
, level=0)
)
The output would be like this
id Menu Type A Type B Type C
1 Fried Chicken 1 0 1
2 Shrimp Chips 0 1 0
3 Pork with Cheese 0 0 1
4 Fish Spaghetti 0 1 0
5 Goat Sate 1 0 0
6 Beef Soup 1 0 0
How to do this on pySpark datafarame
Upvotes: 1
Views: 1798
Reputation: 2477
There is no standard multi-indexing in spark.
What you can do is check if your string contains any of your elements :
from pyspark.sql import functions as F
index_dict = {
'Type A': ['Chicken', 'Beef', 'Goat'],
'Type B': ['Fish', 'Shrimp'],
'Type C': ['Chicken', 'Pork']
}
for col_name, values in index_dict.items():
col = F.col('Menu').like('%'+values[0]+'%')
for value in values[1:]:
col2 = F.col('Menu').like('%'+value+'%')
col = col | col2
df = df.withColumn(col_name, col)
I think it could be faster if you don't check the whole string when you know the separator. You could therefore use :
df = df.withColumn('tmp', F.split(df['Menu'], ' '))
Replace F.col('Menu').like('%'...'%')
with F.array_contains(F.col('tmp'), ...)
And finish with df = df.drop('tmp')
Upvotes: 1