Perform preprocessing operations from pandas on Spark dataframe

Question

I have a rather large CSV so I am using AWS EMR to read the data into a Spark dataframe to perform some operations. I have a pandas function that does some simple preprocessing:

def clean_census_data(df):
    """
    This function cleans the dataframe and drops columns that contain 70% NaN values
    """
    # Replace None or 0 with np.nan
    df = df.replace('None', np.nan)
    # Replace weird numbers
    df = df.replace(-666666666.0, np.nan)
    
    # Drop columns that contain 70% NaN or 0 values
    df = df.loc[:, df.isnull().mean() < .7]
    
    
    return df

I want to apply this function onto a Spark dataframe, but the functions are not the same. I am not familiar with Spark and performing these rather simple operations in pandas is not obvious to me how to perform the same operations in Spark. I know I can convert a Spark dataframe into pandas, but that does not seem very efficient.

KnutKiesel · Accepted Answer

First answer, so please be kind. This function should work with pyspark dataframes instead of pandas dataframes, and should give you similar results:

def clean_census_data(df):
    """
    This function cleans the dataframe and drops columns that contain 70% NaN values
    """
    # Replace None or 0 with np.nan
    df = df.replace('None', None)

    # Replace weird numbers
    df = df.replace(-666666666.0, None)

    # Drop columns that contain 70% NaN or 0 values
    selection_dict = df.select([(count(when(isnan(c) | col(c).isNull() | (col(c).cast('int') == 0), c))/count(c) > .7).alias(c) for c in df.columns]).first().asDict()
    columns_to_remove = [name for name, is_selected in selection_dict.items() if is_selected]
    df = df.drop(*columns_to_remove)

    return df

Attention: The resulting dataframe contains None instead of np.nan.

Perform preprocessing operations from pandas on Spark dataframe

Answers (2)

Related Questions