Ravali
Ravali

Reputation: 79

pyspark how to pass the values dynamically to countDistinct

I have a csv file that contains (FileName,ColumnName,Rule and RuleDetails) as headers.

InputFile

I have multiple rules for column Rule like NotNull,Max,Min etc. For the rule "Unique" there can be multiple columns, I need to pass those columns and perform countDistinct.

Code

If I pass the values dynamically instead of hardcoding I'm getting below error

ErrorMessage

AnalysisException: Column '`"SITEID", "ASSETNUM"`' does not exist. Did you mean one of the following? [spark_catalog.maximo_dq.Assets_new.ASSETNUM, spark_catalog.maximo_dq.Assets_new.HasLD, spark_catalog.maximo_dq.Assets_new.SITEID, spark_catalog.maximo_dq.Assets_new.Status, spark_catalog.maximo_dq.Assets_new.SerialNumber, spark_catalog.maximo_dq.Assets_new.Description, spark_catalog.maximo_dq.Assets_new.InstallDate, spark_catalog.maximo_dq.Assets_new.Classification, spark_catalog.maximo_dq.Assets_new.LongDescription];
  1. Similarly how to get the count of records which are not matching the specified date format. Input which has Rules

I need to take check how many records in INSTALLDATE are not in the format of RuleDetails

Upvotes: 0

Views: 71

Answers (1)

Equinox
Equinox

Reputation: 6748

Use tuple unpacking to pass the values

UNIQUUECOLSString = ['a','b','c'] #keep it in an array
df.select(countDistinct( *UNIQUUECOLSString ))

Upvotes: 1

Related Questions