I have a PySpark DataFrame: name age username password joe 34 Null Null alice 21 Null Null Null Null user1 pass1 Null Null user2 pass2 from the above DataFrame, I want to create 2 DataFrame like this by finding the Null value columns, somehow: name age joe 34 alica 21 username password user1 pass1 user2 pass2 Is there any way to achieve this? Sample JSON files under "source" directory: { "name": "joe", "age": 31 } { "name": "alica", "age": 21 } { "username": "user1", "password": "pass1" } { "username": "user2", "password": "pass2 } Code: conf = SparkConf().setMaster("local").setAppName("Test") spark = SparkSession \ .builder \ .config(conf=conf) \ .getOrCreate() json_data = spark.read.json("source")

Reputation: 1

PySpark: Is there any possibility of creating dynamic number of DataFrame based on the Non Null values

I have a PySpark DataFrame:

name	age	username	password
joe	34	Null	Null
alice	21	Null	Null
Null	Null	user1	pass1
Null	Null	user2	pass2

from the above DataFrame, I want to create 2 DataFrame like this by finding the Null value columns, somehow:

name	age
joe	34
alica	21

username	password
user1	pass1
user2	pass2

Is there any way to achieve this?

Sample JSON files under "source" directory:

{
 "name": "joe",
 "age": 31
}

{
 "name": "alica",
 "age": 21
}

{
 "username": "user1",
 "password": "pass1"
}

{
 "username": "user2",
 "password": "pass2
}

Code:

conf = SparkConf().setMaster("local").setAppName("Test")
spark = SparkSession \
        .builder \
        .config(conf=conf) \
        .getOrCreate()

json_data = spark.read.json("source")

Upvotes: 0

Answers (2)

blackbishop

Reputation: 32640

You can simply use select + dropna() :

df1 = df.select("name", "age").dropna()

df1.show()
#+-----+---+
#| name|age|
#+-----+---+
#|  joe| 34|
#|alice| 21|
#+-----+---+

df2 = df.select("username", "password").dropna()

df2.show()
#+--------+--------+
#|username|password|
#+--------+--------+
#|   user1|   pass1|
#|   user2|   pass2|
#+--------+--------+

Upvotes: 0

Alex Ortner

Reputation: 1238

If you have always the same fixed amount of columns I would just cover all cases with

import pyspark.sql.functions as f


df2=df.where(f.col("name").isNotNull() & f.col("age").isNotNull() & f.col("username").isNotNull() & f.col("password").isNull())

df3=df.where(f.col("name").isNotNull() & f.col("age").isNotNull() & f.col("username").isNull() & f.col("password").isNull())

df3=df.where(f.col("name").isNotNull() & f.col("age").isNull() & f.col("username").isNull() & f.col("password").isNull())

df4=df.where(f.col("name").isNull() & f.col("age").isNotNull() & f.col("username").isNotNull() & f.col("password").isNotNull())

df5=df.where(f.col("name").isNull() & f.col("age").isNull() & f.col("username").isNotNull() & f.col("password").isNotNull())

... and so on

Upvotes: 0

PySpark: Is there any possibility of creating dynamic number of DataFrame based on the Non Null values

Answers (2)

Related Questions