Reputation: 1949
I have a pyspark dataframe like so with rows that have 'zodiac' and rows that have 'nonzodiac'
spark.createDataFrame(
[
(1, '1234ESPNnonzodiac'),
(2, '1234ESPNzodiac'),
(3, '963CNNnonzodiac'),
(4, '963CNNzodiac'),
],
['id', 'col1']
)
I can get all the nonzodiac rows like so:
nonzodiac_rows = df.where(f.col("col1").rlike("nonzodiac"))
But I'm having trouble returning only rows with 'zodiac' since doing something similar returns both zodiac and nonzodiac rows
zodiac_rows = df.where(f.col("col1").rlike("zodiac"))
Upvotes: 0
Views: 958
Reputation: 38335
It is because nonzodiac contains zodiac substring. You need to write more strict regexp For example add one more character:
zodiac_rows = df.where(f.col("col1").rlike("Nzodiac"))
Or restrict non before zodiac
zodiac_rows = df.where(f.col("col1").rlike("(?<!non)zodiac"))
Upvotes: 2
Reputation: 837
You can do the opposite of your "nonzodiac" where
with the logical not
operator
zodiac_rows = df.where(~f.col("col1").rlike("nonzodiac"))
Besides, there is the filter
function that may help you in a different way
zodiac_rows = df.filter(~f.col('col1').contains('nonzodiac')).collect()
Upvotes: 0