Filtering out rows that don’t fit the schema in pyspark

Question

I have a file named employee.csv with columns empid as integer and empname string. I am reading the files into a dataframe d1 by defining the schema and reading into another dataframe d2 as it is. The employee.csv has the data as below:

01,A

02,B

3,C

D,d

I want to list out rows where the empid is not an integer. I converted the column empid to integer in d2 to find out the badrows by using subtract but now I see the row D,d coming as null,d as the output of the subtract command. How do I get the desired out.

I also tried to filter out rows that fails to cast into integer but that doesn't seem to work either.

d3 = d2.filter(d2[“empid”].cast(“int”).isNull())

Please let me know how do we achieve it.

Filtering out rows that don’t fit the schema in pyspark

Answers (1)

Related Questions