Reputation: 1
I have a PySpark DataFrame:
name | age | username | password |
---|---|---|---|
joe | 34 | Null | Null |
alice | 21 | Null | Null |
Null | Null | user1 | pass1 |
Null | Null | user2 | pass2 |
from the above DataFrame, I want to create 2 DataFrame like this by finding the Null value columns, somehow:
name | age |
---|---|
joe | 34 |
alica | 21 |
username | password |
---|---|
user1 | pass1 |
user2 | pass2 |
Is there any way to achieve this?
Sample JSON files under "source" directory:
{
"name": "joe",
"age": 31
}
{
"name": "alica",
"age": 21
}
{
"username": "user1",
"password": "pass1"
}
{
"username": "user2",
"password": "pass2
}
Code:
conf = SparkConf().setMaster("local").setAppName("Test")
spark = SparkSession \
.builder \
.config(conf=conf) \
.getOrCreate()
json_data = spark.read.json("source")
Upvotes: 0
Views: 144
Reputation: 32640
You can simply use select
+ dropna()
:
df1 = df.select("name", "age").dropna()
df1.show()
#+-----+---+
#| name|age|
#+-----+---+
#| joe| 34|
#|alice| 21|
#+-----+---+
df2 = df.select("username", "password").dropna()
df2.show()
#+--------+--------+
#|username|password|
#+--------+--------+
#| user1| pass1|
#| user2| pass2|
#+--------+--------+
Upvotes: 0
Reputation: 1238
If you have always the same fixed amount of columns I would just cover all cases with
import pyspark.sql.functions as f
df2=df.where(f.col("name").isNotNull() & f.col("age").isNotNull() & f.col("username").isNotNull() & f.col("password").isNull())
df3=df.where(f.col("name").isNotNull() & f.col("age").isNotNull() & f.col("username").isNull() & f.col("password").isNull())
df3=df.where(f.col("name").isNotNull() & f.col("age").isNull() & f.col("username").isNull() & f.col("password").isNull())
df4=df.where(f.col("name").isNull() & f.col("age").isNotNull() & f.col("username").isNotNull() & f.col("password").isNotNull())
df5=df.where(f.col("name").isNull() & f.col("age").isNull() & f.col("username").isNotNull() & f.col("password").isNotNull())
... and so on
Upvotes: 0