Reputation: 21
I have a spark dataframe with multiple columns and each column contains a string.
Eg Input:
+--------------+--------------+--------------+--------------+
| c1| c2 | c3| c4|
+--------------+--------------+--------------+--------------+
|11 - 12 - 1993| 4 | 4 | 2014 | 8 - 7 - 2013 | null |
|12 / 6 / 1965 | 8 - 6 - 2013 | date missing |11 - 12 - 1993|
|10 / 5 / 2001 | 7 - 11 - 2011| 4 | 5 | 2015 | 10 / 5 / 2001|
+--------------+--------------+--------------+--------------+
I need to return the columns where all the values match a particular regex pattern.
In the case of this example, I have to return all the columns in which have all the values are valid dates. That is all values in column C1 & C2 have a valid date (irrespective of their format) should be returned in this case.
Eg Output
+--------------+--------------+
| c1| c2 |
+--------------+--------------+
|11 - 12 - 1993| 4 | 4 | 2014 |
|12 / 6 / 1965 | 8 - 6 - 2013 |
|10 / 5 / 2001 | 7 - 11 - 2011|
+--------------+--------------+
I have the regex. What would be the best way to do this? I want to figure out the most efficient way to do this.
Upvotes: 2
Views: 1323
Reputation: 13998
One way without using regex:
from pyspark.sql.functions import coalesce, to_date
# list of all expected date format
fmts = ['MM - d - yyyy', 'MM / d / yyyy', 'MM | d | yyyy']
# columns to check
cols = df.columns
# convert columns into date using coalesce and to_date on all available fmts
# convert the resulting column to StringType (as df.summary() doesn't work on DateType)
df1 = df.select([ coalesce(*[to_date(c, format=f) for f in fmts]).astype('string').alias(c) for c in cols])
df1.show()
+----------+----------+----------+----------+
| c1| c2| c3| c4|
+----------+----------+----------+----------+
|1993-11-12|2014-04-04|2013-08-07| null|
|1965-12-06|2013-08-06| null|1993-11-12|
|2001-10-05|2011-07-11|2015-04-05|2001-10-05|
+----------+----------+----------+----------+
Now your task becomes counting if columns contain any null values.
# get all Number of rows in df
N = df.count()
# 3
# use df.summary('count') find all non-null #Row for each columns
df1.summary('count').show()
+-------+---+---+---+---+
|summary| c1| c2| c3| c4|
+-------+---+---+---+---+
| count| 3| 3| 2| 2|
+-------+---+---+---+---+
find the column names having count == N
:
cols_keep = [ c for c,v in df1.summary('count').select(cols).first().asDict().items() if int(v) == N ]
# ['c1', 'c2']
df_new = df.select(cols_keep)
If you want to use your regex to handle this task:
from pyspark.sql.functions import regexp_extract
# pattern should be very complex in order to effectively validate dates. this one is just for testing
ptn = r'\d+ [-/|] \d+ [-/|] \d+'
df1 = df.select([ regexp_extract(c, ptn, 0).alias(c) for c in cols ] ).replace('', None)
+--------------+-------------+------------+--------------+
| c1| c2| c3| c4|
+--------------+-------------+------------+--------------+
|11 - 12 - 1993| 4 | 4 | 2014|8 - 7 - 2013| null|
| 12 / 6 / 1965| 8 - 6 - 2013| null|11 - 12 - 1993|
| 10 / 5 / 2001|7 - 11 - 2011|4 | 5 | 2015| 10 / 5 / 2001|
+--------------+-------------+------------+--------------+
then this becomes the same as the above to find and exclude columns containing null values.
you can do this in one-pass w/o using df.summary()
:
from pyspark.sql.functions import regexp_extract, sum, when
df1 = df.select([ sum(when(regexp_extract(c, ptn, 0)== '',1).otherwise(0)).alias(c) for c in cols ] )
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| 0| 0| 1| 1|
+---+---+---+---+
cols_keep = [ c for c,v in df1.first().asDict().items() if not v ]
similar to the previous method w/o using the regex.
Upvotes: 1