Create a dataframe only selecting rows that match condition

Question

I have a big table in Hive (dozens to hundreds of millions of rows) from which I want to only choose those that match a regex.

Currently I have a small example to try my code first:

columns = ['id', 'column']
vals = [
(1, "VAL_ID1 BD store"),
(2, "VAL_ID2 BD store"),
(3, "VAL_ID3 BD model"),
(4, "BAD WRONG")
]

df = spark.createDataFrame(vals, columns)

And then I have a regex tested that goes like:

df_regex = df.withColumn('newColumn',F.regexp_extract(df['id'], '^(([a-zA-Z]{2}[a-zA-Z0-9]{1})+(_[a-zA-Z]{2}[a-zA-Z0-9]{1})*)(\s|$)',1))

As I said, this is a test dataframe. In the future I will make it "look" at a very large table. Is there any way to only add rows that match the regex, and thus create a much smaller dataframe?

As it is right now, I am reading every single row, then adding a column withColumn that has an empty field for the rows that do not match the regex. Which makes sense, but I feel like there is benefit in not reading this dataframe two times if I can avoid it.

Steven · Accepted Answer

You want to use the where probably.

df.where(
    F.regexp_extract(df['id'], '^(([a-zA-Z]{2}[a-zA-Z0-9]{1})+(_[a-zA-Z]{2}[a-zA-Z0-9]{1})*)(\s|$)',1) != F.lit('')
)

Actually, I tried your regex and it gives no results. But as long as you understand the principle, I think you can use that solution.

EDIT:

I feel like there is benefit in not reading this dataframe two times if I can avoid it.

Spark will read your data only if you perform "action". Transformations are lazy and therefore evaluated only at the end ... so no need to worry about Spark reading your data twice (or more).

Create a dataframe only selecting rows that match condition

Answers (1)

Related Questions