Reputation: 23
I have around 275
columns and I would like to search 25
columns for a regex string "^D(410|412)
. If this search string is is present in any of 2
5 columns I would like to add true
to MyNewColumn
.
using below I could do it for 2
columns. Is there anyway for passing variable number of columns ?
Below code works for 2 columns
def moreThanTwoArgs(col1,col2):
return bool((re.search("^D(410|412)",col1) or re.search("^D(410|412)",col2)))
twoUDF= udf(moreThanTwoArgs,BooleanType())
df = df.withColumn("MyNewColumn", twoUDF(df["X1"], df["X2"]))
Upvotes: 2
Views: 3622
Reputation: 4420
I tried some what similar have sample code try this and proceed:-
df1 = sc.parallelize(
[
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
]).toDF(['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10'])
df1.show()
+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5| c6| c7| c8| c9|c10|
+---+---+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
+---+---+---+---+---+---+---+---+---+---+
import pyspark.sql.functions as F
import pyspark.sql.types as T
import re
def booleanFindFunc(*args):
return sum(args)
udfBoolean = F.udf(booleanFindFunc, T.StringType())
#Below is Sum of three columns (c1+c2+c2)
df1.withColumn("MyNewColumn", booleanFindFunc(F.col("c1"), F.col("c2"), F.col("c2"))).show()
+---+---+---+---+---+---+---+---+---+---+-----------+
| c1| c2| c3| c4| c5| c6| c7| c8| c9|c10|MyNewColumn|
+---+---+---+---+---+---+---+---+---+---+-----------+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 5|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 5|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 5|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 5|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 5|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 5|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 5|
+---+---+---+---+---+---+---+---+---+---+-----------+
#Below is Sum of All Columns (c1+c2+c3---+c10)
df1.withColumn("MyNewColumn", booleanFindFunc(*[F.col(i) for i in df1.columns])).show()
+---+---+---+---+---+---+---+---+---+---+-----------+
| c1| c2| c3| c4| c5| c6| c7| c8| c9|c10|MyNewColumn|
+---+---+---+---+---+---+---+---+---+---+-----------+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 55|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 55|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 55|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 55|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 55|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 55|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 55|
+---+---+---+---+---+---+---+---+---+---+-----------+
#Below is Sum of All odd Columns (c1+c3+c5--+c9)
df1.withColumn("MyNewColumn", booleanFindFunc(*[F.col(i) for i in df1.columns if int(i[1:])%2])).show()
+---+---+---+---+---+---+---+---+---+---+-----------+
| c1| c2| c3| c4| c5| c6| c7| c8| c9|c10|MyNewColumn|
+---+---+---+---+---+---+---+---+---+---+-----------+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 25|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 25|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 25|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 25|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 25|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 25|
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 25|
+---+---+---+---+---+---+---+---+---+---+-----------+
Hope This will solve your problem
Upvotes: 5