Reputation: 375
My code below works however it is replacing all nulls in the dataframe to "nI". I only want to replace nulls for the columns that are being renamed. I want to do this without hardcoding any column names.
df =datasetMatchedDomains
for i in TRUE_matchedAttributeName_List.keys():
df = df.withColumnRenamed(i,TRUE_matchedAttributeName_List[i]);
df_final=df.na.fill('NI')
display(df_final)
else:
print("clean")
Upvotes: 2
Views: 140
Reputation: 1405
you can mention the subset of columns you want in df.na. You can find more info about here
Here is an example
df = sc.parallelize([
("portfolio1",None ,"star1"), (None, "Lease", "star2"), ("portfolio2",None, "star3")]).toDF(["a", "b", "c"])
df.show()
+----------+-----+-----+
| a| b| c|
+----------+-----+-----+
|portfolio1| null|star1|
| null|Lease|star2|
|portfolio2| null|star3|
+----------+-----+-----+
TRUE_matchedAttributeName = {'a':'a1'}
subset=[]
for i in TRUE_matchedAttributeName.keys():
subset.append(TRUE_matchedAttributeName[i])
df = df.withColumnRenamed(i, TRUE_matchedAttributeName[i])
df.fillna('source not implemented', subset=subset).show(truncate=False)
+----------------------+-----+-----+
|a1 |b |c |
+----------------------+-----+-----+
|portfolio1 |null |star1|
|source not implemented|Lease|star2|
|portfolio2 |null |star3|
+----------------------+-----+-----+
Upvotes: 1