user16815188
user16815188

Reputation:

How can I extract columns from a dataframe according to an array and if it does not find a column, that column should contain null values? - pyspark

I have an array named "extractColumns" and I have a dataframe named "raw_data". I wanted to create a new dataframe according to the array and the dataframe. Even if, when doing the "select", it does not find a column in the dataframe, that column would have to come NULL.

How can I do this

Upvotes: 0

Views: 77

Answers (1)

Luiz Viola
Luiz Viola

Reputation: 2436

raw_data  = spark.createDataFrame(
  [
('1',20),
('2',34),
('3',12)
  ], ['foo','bar'])



#columns I want to extract from raw_dataframe
extractColumns = ['refsnp_id', 'chr_name', 'chrom_start', 'chrom_end', 'version']


import pyspark.sql.functions as F

new_raw_data = raw_data

for col in extractColumns:
    if col not in raw_data.columns:
        new_raw_data = new_raw_data.withColumn(col, F.lit(None))\

        
        
new_raw_data.show()
+---+---+---------+--------+-----------+---------+-------+
|foo|bar|refsnp_id|chr_name|chrom_start|chrom_end|version|
+---+---+---------+--------+-----------+---------+-------+
|  1| 20|     null|    null|       null|     null|   null|
|  2| 34|     null|    null|       null|     null|   null|
|  3| 12|     null|    null|       null|     null|   null|
+---+---+---------+--------+-----------+---------+-------+

Upvotes: 1

Related Questions