Dan
Dan

Reputation: 71

How to apply a schema from an existing DataFrame to another DataFrame with missing columns in PySpark

I have a JSON file with various levels of nested struct/array columns in one DataFrame, df_1. I have a smaller DataFrame, df_2, with less columns, but the column names match with some column names in df_1, and none of the nested structure.

I want to apply the schema from df_1 to df_2 in a way that the two share the same schema, taking the existing columns in df_2 where possible, and creating the columns/nested structure that exist in df_1 but not df_2.

df_1

root
 |-- association_info: struct (nullable = true)
 |    |-- ancestry: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- doi: string (nullable = true)
 |    |-- gwas_catalog_id: string (nullable = true)
 |    |-- neg_log_pval: double (nullable = true)
 |    |-- study_id: string (nullable = true)
 |    |-- pubmed_id: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- gold_standard_info: struct (nullable = true)
 |    |-- evidence: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- class: string (nullable = true)
 |    |    |    |-- confidence: string (nullable = true)
 |    |    |    |-- curated_by: string (nullable = true)
 |    |    |    |-- description: string (nullable = true)
 |    |    |    |-- pubmed_id: string (nullable = true)
 |    |    |    |-- source: string (nullable = true)
 |    |-- gene_id: string (nullable = true)
 |    |-- highest_confidence: string (nullable = true)

df_2

root
 |-- study_id: string (nullable = true)
 |-- description: string (nullable = true)
 |-- gene_id: string (nullable = true)

The expected output would be to have the same schema as df_1, and for any columns that don't exist in df_2 to just fill with null.

I have tried completely flattening the structure of df_1 to join the two DataFrames, but then I'm unsure how to change it back into the original schema. All solutions I've attempted so far have been in PySpark. It would be preferable to use PySpark for performance considerations, but if a solution requires converted to a Pandas DataFrame that's also feasible.

Upvotes: 0

Views: 593

Answers (1)

M M Kamalraj
M M Kamalraj

Reputation: 1

df1.select('association_info.study_id', 
           'gold_standard_info.evidence.element.description',
          'gold_standard_info.gene_id')

The above code will reach into the df1 and provide you requisite fields in df2. The schema will remain same.

Could you try the same.

Upvotes: 0

Related Questions