Reputation:
I have an array named "extractColumns" and I have a dataframe named "raw_data". I wanted to create a new dataframe according to the array and the dataframe. Even if, when doing the "select", it does not find a column in the dataframe, that column would have to come NULL.
How can I do this
Upvotes: 0
Views: 77
Reputation: 2436
raw_data = spark.createDataFrame(
[
('1',20),
('2',34),
('3',12)
], ['foo','bar'])
#columns I want to extract from raw_dataframe
extractColumns = ['refsnp_id', 'chr_name', 'chrom_start', 'chrom_end', 'version']
import pyspark.sql.functions as F
new_raw_data = raw_data
for col in extractColumns:
if col not in raw_data.columns:
new_raw_data = new_raw_data.withColumn(col, F.lit(None))\
new_raw_data.show()
+---+---+---------+--------+-----------+---------+-------+
|foo|bar|refsnp_id|chr_name|chrom_start|chrom_end|version|
+---+---+---------+--------+-----------+---------+-------+
| 1| 20| null| null| null| null| null|
| 2| 34| null| null| null| null| null|
| 3| 12| null| null| null| null| null|
+---+---+---------+--------+-----------+---------+-------+
Upvotes: 1