Multiindex categorization and encoding this in PySpark

Question

I have multiindex mapping rules, here's the rules

Type A: Chicken, Beef, Goat
Type B: Fish, Shrimp
Type C: Chicken, Pork

I here's my dataframe, let say this is a df dataframe, and want to do multi index mapping

   id   Menu
   1    Fried Chicken
   2    Shrimp Chips
   3    Pork with Cheese
   4    Fish Spaghetti
   5    Goat Sate
   6    Beef Soup

With pandas, it can be use pd.MultiIndex and pd.Series.str.get_dummies. So, the code will be

from pandas import pd
from numpy.core.defchararray import find

m = {
    'Type A': ['Chicken', 'Beef', 'Goat'],
    'Type B': ['Fish', 'Shrimp'],
    'Type C': ['Chicken', 'Pork']
}

mux = pd.MultiIndex.from_tuples(
    [(k, v) for k, values in m.items() for v in values])

df.join(
    df.Menu.str.get_dummies(sep=' ') \
      .reindex(columns=mux, level=1).max(axis=1
, level=0)
)

The output would be like this

 id     Menu                 Type A   Type B   Type C
   1    Fried Chicken        1        0        1
   2    Shrimp Chips         0        1        0
   3    Pork with Cheese     0        0        1
   4    Fish Spaghetti       0        1        0
   5    Goat Sate            1        0        0
   6    Beef Soup            1        0        0

How to do this on pySpark datafarame

Pierre Gourseaud · Accepted Answer

There is no standard multi-indexing in spark.

What you can do is check if your string contains any of your elements :

from pyspark.sql import functions as F

index_dict = {
    'Type A': ['Chicken', 'Beef', 'Goat'],
    'Type B': ['Fish', 'Shrimp'],
    'Type C': ['Chicken', 'Pork']
}

for col_name, values in index_dict.items():
    col = F.col('Menu').like('%'+values[0]+'%')
    for value in values[1:]:
        col2 = F.col('Menu').like('%'+value+'%')
        col = col | col2
    df = df.withColumn(col_name, col)

I think it could be faster if you don't check the whole string when you know the separator. You could therefore use :

df = df.withColumn('tmp', F.split(df['Menu'], ' '))

Replace F.col('Menu').like('%'...'%') with F.array_contains(F.col('tmp'), ...)

And finish with df = df.drop('tmp')

Multiindex categorization and encoding this in PySpark

Answers (1)

Related Questions