Manish
Manish

Reputation: 33

Filtering out rows that don’t fit the schema in pyspark

I have a file named employee.csv with columns empid as integer and empname string. I am reading the files into a dataframe d1 by defining the schema and reading into another dataframe d2 as it is. The employee.csv has the data as below:

01,A\n
02,B\n
3,C\n
D,d\n

I want to list out rows where the empid is not an integer. I converted the column empid to integer in d2 to find out the badrows by using subtract but now I see the row D,d coming as null,d as the output of the subtract command. How do I get the desired out.

I also tried to filter out rows that fails to cast into integer but that doesn't seem to work either.

d3 = d2.filter(d2[“empid”].cast(“int”).isNull())

Please let me know how do we achieve it.

Upvotes: 0

Views: 56

Answers (1)

iBeMeltin
iBeMeltin

Reputation: 1787

To return a list of empid's that are not integers you can use filter() and isdigit():

out = d2.filter(~d2['empid'].astype(str).str.isdigit()).tolist()

Upvotes: 0

Related Questions