Generate UUID column with a UDF and then split into two dataframes with common UUID column

Question

I have a dataframe containing info about people with columns first_name, last_name, and email_addresses, where email_addresses is a comma separated list of email addresses. The data looks something like this:

first_name	last_name	email_addresses
Jane	Doe	jane.doe@example.com,doe.jane@example.com
John	Smith	john.smith@example.com,smith.john@example.com

I'm trying to split this into two dataframes by first adding a person_id column populated with UUIDs using a UDF, and then creating a new dataframe by doing a split and explode on the email_addresses column.

My code looks something like this:

from uuid import uuid4
from pyspark.sql import functions as F
from pyspark.sql.types import StringType

uuid_udf = F.udf(lambda: str(uuid4()), StringType()).asNondeterministic()

df = ingest_data_from_csv() # Imagine this function populates a dataframe from a CSV

df_with_id = df.withColumn("person_id", uuid_udf())

person_df = df_with_id.drop("email_addresses")

email_df = (
    df_with_id.drop("first_name", "last_name")
    .withColumn("email_address", F.explode(F.split(F.col("email_addresses"), ",")))
    .drop("email_addresses")

I thought this would give me two data frames with a foreign key type of relationship that look something like this:

person_id	first_name	last_name
00000000-0000-0000-0000-000000000001	Jane	Doe
00000000-0000-0000-0000-000000000002	John	Smith

person_id	email_address
00000000-0000-0000-0000-000000000001	jane.doe@example.com
00000000-0000-0000-0000-000000000001	doe.jane@example.com
00000000-0000-0000-0000-000000000002	john.smith@example.com
00000000-0000-0000-0000-000000000002	smith.john@example.com

but what I end up with is different values in all the person_id columns, looking something like this:

person_id	first_name	last_name
00000000-0000-0000-0000-000000000001	Jane	Doe
00000000-0000-0000-0000-000000000002	John	Smith

person_id	email_address
00000000-0000-0000-0000-000000000003	jane.doe@example.com
00000000-0000-0000-0000-000000000004	doe.jane@example.com
00000000-0000-0000-0000-000000000005	john.smith@example.com
00000000-0000-0000-0000-000000000006	smith.john@example.com

Is there a way to implement this "foreign key" type of relationship with pyspark?

mck · Accepted Answer

cache the dataframe so that the results will not change when you call the Python functions again. Also you can call the uuid spark SQL function with F.expr('uuid()'):

df_with_id = df.withColumn("id", F.expr('uuid()'))
# or if you want to use your udf,
# df_with_id = df.withColumn("id", uuid_udf())

df_with_id.cache()   ### IMPORTANT!

person_df = df_with_id.drop("email_addresses")
email_df = (
    df_with_id.drop("first_name", "last_name")
    .withColumn("email_address", F.explode(F.split(F.col("email_addresses"), ",")))
    .drop("email_addresses")
)

email_df.show()
+--------------------+--------------------+
|                  id|       email_address|
+--------------------+--------------------+
|389fb85d-338f-469...|jane.doe@example.com|
|389fb85d-338f-469...|doe.jane@example.com|
|6600abe9-8087-420...|john.smith@exampl...|
|6600abe9-8087-420...|smith.john@exampl...|
+--------------------+--------------------+

person_df.show()
+----------+---------+--------------------+
|first_name|last_name|                  id|
+----------+---------+--------------------+
|      Jane|      Doe|389fb85d-338f-469...|
|      John|    Smith|6600abe9-8087-420...|
+----------+---------+--------------------+

Generate UUID column with a UDF and then split into two dataframes with common UUID column

Answers (2)

Related Questions

first_name	last_name	email_addresses
Jane	Doe	[email protected],[email protected]
John	Smith	[email protected],[email protected]

person_id	email_address
00000000-0000-0000-0000-000000000001	[email protected]
00000000-0000-0000-0000-000000000001	[email protected]
00000000-0000-0000-0000-000000000002	[email protected]
00000000-0000-0000-0000-000000000002	[email protected]