Papun Mohanty
Papun Mohanty

Reputation: 1

PySpark: Is there any possibility of creating dynamic number of DataFrame based on the Non Null values

I have a PySpark DataFrame:

name age username password
joe 34 Null Null
alice 21 Null Null
Null Null user1 pass1
Null Null user2 pass2

from the above DataFrame, I want to create 2 DataFrame like this by finding the Null value columns, somehow:

name age
joe 34
alica 21
username password
user1 pass1
user2 pass2

Is there any way to achieve this?

Sample JSON files under "source" directory:

{
 "name": "joe",
 "age": 31
}

{
 "name": "alica",
 "age": 21
}

{
 "username": "user1",
 "password": "pass1"
}

{
 "username": "user2",
 "password": "pass2
}

Code:

conf = SparkConf().setMaster("local").setAppName("Test")
spark = SparkSession \
        .builder \
        .config(conf=conf) \
        .getOrCreate()

json_data = spark.read.json("source") 

Upvotes: 0

Views: 144

Answers (2)

blackbishop
blackbishop

Reputation: 32640

You can simply use select + dropna() :

df1 = df.select("name", "age").dropna()

df1.show()
#+-----+---+
#| name|age|
#+-----+---+
#|  joe| 34|
#|alice| 21|
#+-----+---+

df2 = df.select("username", "password").dropna()

df2.show()
#+--------+--------+
#|username|password|
#+--------+--------+
#|   user1|   pass1|
#|   user2|   pass2|
#+--------+--------+

Upvotes: 0

Alex Ortner
Alex Ortner

Reputation: 1238

If you have always the same fixed amount of columns I would just cover all cases with

import pyspark.sql.functions as f


df2=df.where(f.col("name").isNotNull() & f.col("age").isNotNull() & f.col("username").isNotNull() & f.col("password").isNull())

df3=df.where(f.col("name").isNotNull() & f.col("age").isNotNull() & f.col("username").isNull() & f.col("password").isNull())

df3=df.where(f.col("name").isNotNull() & f.col("age").isNull() & f.col("username").isNull() & f.col("password").isNull())

df4=df.where(f.col("name").isNull() & f.col("age").isNotNull() & f.col("username").isNotNull() & f.col("password").isNotNull())

df5=df.where(f.col("name").isNull() & f.col("age").isNull() & f.col("username").isNotNull() & f.col("password").isNotNull())

... and so on

Upvotes: 0

Related Questions