How can I insert a PySpark dataframe into a database with a snowflake schema?

Question

With PySpark I'm computing a dataframe, how can I append this dataframe into my database, if this database has a snowflake schema?

How can I specify which way to split my dataframe in order to fit my CSV-like data into multiple joint tables?

My question is not specific to Pyspark, the same question could be asked about pandas.

Chris · Accepted Answer

To append a dataframe extracted from a CSV to a database consisting of a snowflake schema:

Extract the data from the snowflake schema.
Extract the new data from the external datasource.
Combine the two data sets.
Transform the combination to a set of dimension and fact tables to match the snowflake schema.
Load the transformed dataframes to the database, overwriting the existing data.

e.g. For a dataframe with the following schema, extracted from an external source:

StructType([StructField('customer_name', StringType()),
            StructField('campaign_name', StringType())])

def entrypoint(spark: SparkSession) -> None:
  extracted_customer_campaigns = extract_from_external_source(spark)

  existing_customers_dim, existing_campaigns_dim, existing_facts = (
    extract_from_snowflake(spark))

  combined_customer_campaigns = combine(existing_campaigns_dim,
                                        existing_customers_dim,
                                        existing_facts,
                                        extracted_customer_campaigns)

  new_campaigns_dim, new_customers_dim, new_facts = transform_to_snowflake(
    combined_customer_campaigns)

  load_snowflake(new_campaigns_dim, new_customers_dim, new_facts)


def combine(campaigns_dimension: DataFrame,
            customers_dimension: DataFrame,
            facts: DataFrame,
            extracted_customer_campaigns: DataFrame) -> DataFrame:
  existing_customer_campaigns = facts.join(
    customers_dimension,
    on=['customer_id']).join(
    campaigns_dimension, on=['campaign_id']).select('customer_name',
                                                    'campaign_name')

  combined_customer_campaigns = extracted_customer_campaigns.union(
    existing_customer_campaigns).distinct()

  return combined_customer_campaigns


def transform_to_snowflake(customer_campaigns: DataFrame) -> (
    DataFrame, DataFrame):
  customers_dim = customer_campaigns.select(
    'customer_name').distinct().withColumn(
    'customer_id', monotonically_increasing_id())

  campaigns_dim = customer_campaigns.select(
    'campaign_name').distinct().withColumn(
    'campaign_id', monotonically_increasing_id())

  facts = (
    customer_campaigns.join(customers_dim,
                            on=['customer_name']).join(
      campaigns_dim, on=[
        'campaign_name']).select('customer_id', 'campaign_id'))

  return campaigns_dim, customers_dim, facts

This is a simple functional approach. It maybe possible to optimise by writing deltas, rather than regenerating snowflake keys for each ETL batch.

In addition, if a separate external CSV were supplied containing records for deletion, this could be similarly extracted, then subtracted from the combined dataframe before transformation, in order to remove those existing records.

Finally, the question referred only to appending to a table. Additional steps would need to be manually added if merging/upserting were required as Spark itself does not support it.

How can I insert a PySpark dataframe into a database with a snowflake schema?

Answers (2)

Related Questions