Reputation: 441
i am using spark 1.6 with scala.
I have created a Dataframe which looks like below.
DATA
SKU, MAKE, MODEL, GROUP SUBCLS IDENT
IM, AN4032X, ADH3M032, RM, 1011, 0
IM, A3M4936, MP3M4936, RM, 1011, 0
IM, AK116BC, 3M4936P, 05, ABC, 0
IM, A-116-B, 16ECAPS, RM, 1011, 0
I am doing data validation and capture any record in new dataframe which violate the rule.
Rule:
Column “GROUP” must be character
Column “SUBCLS” must be NUMERIC
Column “IDENT” must be 0
The new Dataframe will looks like
AUDIT TABLE
SKU MAKE AUDIT_SKU AUDIT_MAKE AUDIT_MODEL AUDIT_GRP AUDIT_SUBCLS Audit_IDENT
IM, A-K12216BC, N, N, N, Y, Y, N
Y represent rule violation and N represent Rule pass.
i have validated rule using isnull or regex for ex: checking column Group using
regex: df.where( $"GROUP".rlike("^[A-Za-z]}$")).show
May someone please help me how can i do this in SPARK SQL. is it possible to create a dataframe with the above scenario.
Thanks
Upvotes: 1
Views: 131
Reputation: 1892
you can use rlike with |
scala> df.withColumn("Group1",when($"GROUP".rlike("^[\\d+]|[A-Za-z]\\d+"),"Y").otherwise("N")).withColumn("SUBCLS1",when($"SUBCLS".rlike("^[0-9]"),"N").otherwise("Y")).withColumn("IDENT1",when($"IDENT"==="0","N").otherwise("Y")).show()
+---+-------+--------+-----+------+-----+------+-------+------+
|SKU| MAKE| MODEL|GROUP|SUBCLS|IDENT|Group1|SUBCLS1|IDENT1|
+---+-------+--------+-----+------+-----+------+-------+------+
| IM|AN4032X|ADH3M032| RM| 1011| 0| N| N| N|
| IM|A3M4936|MP3M4936| 1RM| 1011| 0| Y| N| N|
| IM|AK116BC| 3M4936P| 05| ABC| 0| Y| Y| N|
| IM|A-116-B| 16ECAPS| RM1| 1011| 0| Y| N| N|
+---+-------+--------+-----+------+-----+------+-------+------+
just write version 1 of each column for understanding purpose only you can overwrite column. let me know if you need any help on the same.
Upvotes: 1