Reputation: 287
I have large file with product items in a pyspark, some of them are exact numbers and other contain strings. I want to remove all row items with numbers from the dataframe (computationally efficient).
|Product-1| Pj3f|
|Product-2| 12 |
|Product-3| Pj2F|
|Product-4| 20 |
How do i filter rows by type of an item in a column of a pyspark dataframe? The pyspark filter function does not seem to have functionality for it.
k
Upvotes: 0
Views: 2075
Reputation: 31540
cast
the column to int
then filter only the null
value columns.
.rlike
functionExample:
df.show()
#+---------+-----+
#| product|descr|
#+---------+-----+
#|Product-1| pj3f|
#|product-2| 12|
#+---------+-----+
df.filter(col("descr").cast("int").isNull()).show()
df.filter(~col("descr").rlike("^([\s\d]+)$")).show()
#+---------+-----+
#| product|descr|
#+---------+-----+
#|Product-1| pj3f|
#+---------+-----+
Upvotes: 2
Reputation: 1118
Columns in spark are all of the same type. If you mix two columns with a union
for example of different types, spark will try to transform to a valid type for both, usualy String
, and puts the string representation of the values.
Examples:
String
column and a Float
, will result into a String
column, with the floats represented in a string with the dot for decimals. String + Float => String
Integer
column union a Float
will transform all the integers into Floats. Integer + Float => Float
For your case will depend, if its a topical string or numeric, I would go for the regex filtering.
val stringsDF = df.filter(regex_extract($"column", "[0-9]+([.|,][0-9]+)?") === ""))
This will keep all values that don't follow a float or integer value.
Upvotes: 0