JanBennk
JanBennk

Reputation: 287

remove rows from pyspark dataframe by type

I have large file with product items in a pyspark, some of them are exact numbers and other contain strings. I want to remove all row items with numbers from the dataframe (computationally efficient).

|Product-1| Pj3f|
|Product-2| 12  |
|Product-3| Pj2F|
|Product-4| 20  |

How do i filter rows by type of an item in a column of a pyspark dataframe? The pyspark filter function does not seem to have functionality for it.

k

Upvotes: 0

Views: 2075

Answers (2)

notNull
notNull

Reputation: 31540

cast the column to int then filter only the null value columns.

  • Or by using .rlike function

Example:

df.show()
#+---------+-----+
#|  product|descr|
#+---------+-----+
#|Product-1| pj3f|
#|product-2|   12|
#+---------+-----+

df.filter(col("descr").cast("int").isNull()).show()
df.filter(~col("descr").rlike("^([\s\d]+)$")).show()
#+---------+-----+
#|  product|descr|
#+---------+-----+
#|Product-1| pj3f|
#+---------+-----+

Upvotes: 2

Alfilercio
Alfilercio

Reputation: 1118

Columns in spark are all of the same type. If you mix two columns with a union for example of different types, spark will try to transform to a valid type for both, usualy String, and puts the string representation of the values.

Examples:

  • A String column and a Float, will result into a String column, with the floats represented in a string with the dot for decimals. String + Float => String
  • A Integer column union a Float will transform all the integers into Floats. Integer + Float => Float

For your case will depend, if its a topical string or numeric, I would go for the regex filtering.

val stringsDF = df.filter(regex_extract($"column", "[0-9]+([.|,][0-9]+)?") === ""))

This will keep all values that don't follow a float or integer value.

Upvotes: 0

Related Questions