PySpark: Filter out all lines, which have more columns than header line

Question

df = spark.read.csv('input.csv', header=True, inferSchema=True)

Let's say that 'input.csv' file contains following data:

id, name, age
1, John, 20
2, Mike, 33
3, Phil, 19, 180, 78
4, Sean, 40

I would like to filter out rows, which contain more columns than the header and save it to different output somewhat like this (illustratively):

df2 = df.filter(condition1) #condition1 = rows which have more columns than header
df = df.filter(condition2) #condition2 = rows which have same amount or less columns than header
df.show()
df2.show()

So I would get output as follows:

+---+------+----+
| id|  name| age|
+---+------+----+
|  1|  John|  20|
|  2|  Mike|  33|
|  4|  Sean|  40|
+---+------+----+


+---+------+----+
| id|  name| age|
+---+------+----+
|  3|  Phil|  19|
+---+------+----+

So far I've found nothing. Currently it just shrinks the row to fit the header with no way of obtaining it. What can I do? Thanks

EDIT: The schema does not necessarily need to be "id", "name, "age". It should literally take anything, so the filtering can not be dependent on a specific column. Moreover, the solution can not be exclusive to a specific type of reading, the data will be received based on what a user chooses and only thing that can be modified are parameters and options.

PySpark: Filter out all lines, which have more columns than header line

Answers (1)

Related Questions