Reputation: 119
I have a dataframe with sometimes null values that I would like to replace with a single value from a range of values (such as a random integer from 0 to 100) and not constantly the same value.
The na.fill()
function doesn't seem to allow this, and I can't find a good way to do it manually.
I work in Python & Spark 2.2
Upvotes: 2
Views: 265
Reputation: 119
I finally came up with the following solution, hope it may help some with the specific need I had, which was to remove null values from a column of a dataframe and replace them with random values :
def newRow(model,dataframe):
rows=[]
limit = 0
exec("limit = dataframe.where(dataframe." + model[0] + ".isNull()).count()")
for i in range(0, limit):
x = ""
exec("x = dataframe.where(dataframe."+ model[0] +".isNull()).collect()[i]")
schema = StructType([StructField("A", StringType(), True),
StructField("B", StringType(), True),
StructField("C", StringType(), True),
StructField("D", StringType(), True),
StructField("E", StringType(), True)])
A = None
B = None
C = None
D = None
E = None
if x["A"] != None and model[0] != "A":
A = x["A"].encode('ascii')
if x["B"] != None and model[0] != "B":
B = x["B"].encode('ascii')
if x["C"] != None and model[0] != "C":
C = x["C"].encode('ascii')
if x["D"] != None and model[0] != "D":
D = x["D"].encode('ascii')
if x["E"] != None and model[0] != "E":
E = x["E"].encode('ascii')
exec(model[0] + "=" + model[1])
rows.append(Row(A, B, C, D, E))
return sqlContext.createDataFrame(rows,schema)
How to call the method :
dfAmodel = newRow(("A","random.uniform(40, 80)"), df1)
df2 = df1.na.drop(subset=['A']).union(dfAmodel)
Here dfAmodel is a new dataframe with as much rows as null values from column A from input dataframe df1. Only null values from A are changed, others columns' values remain the same.
Misc :
exec() is pretty usefull as it executes the code from a string.
The strength of the solution is to be able to reuse this method for as many df you need, simply change the dataframe you want as input, specify the column to consider and the formula from which news values will be taken from.
Upvotes: 1
Reputation: 6515
Go through each cell, one by one, check to see if the value is null and if it is then change it to a random number.
First, import random. Then something like:
df = df.where(df.a.isNull()).replace(null, random.randrange(min, max+1))
Upvotes: 1