Reputation: 143
I use Spark 2 from PySpark. The dataframe looks like
a = [('n_a xxxx 1111',0), ('n_A xxsssxx 1211',0),('n_a 1111',0),('n_c xxxx 1111',0)]
a = spark.createDataFrame(a, ['des', 'id'])
a.show(10,False)
I would like to select the row which des_column starts
with 'n_a'(case insensitive) and get the first 4 digits number to build a new column the result should looks like
b = [('n_a 1111', ), ('n_A 1211', ),('n_a 1111', )]
b = spark.createDataFrame(b, ['new_column'])
b.show(10, False)
Upvotes: 0
Views: 4063
Reputation: 35229
You can use regexp_extract
from pyspark.sql.functions import *
r = ("(?i)" # Case insensitive
"^(n_a)" # Leading n_a
"(?:\s\S+\s|\s)" # Either whitespace string whitespace or whitespace
"([0-9]{4})") # Four digit number
a.select("id", concat_ws(
" ",
regexp_extract("des", r, 1), # n_a prefix
regexp_extract("des", r, 2) # number
).alias("new_column")).where(trim(col("new_column")) != "")
which gives:
+---+----------+
| id|new_column|
+---+----------+
| 0| n_a 1111|
| 0| n_A 1211|
| 0| n_a 1111|
+---+----------+
Upvotes: 3
Reputation: 427
Try the code snippet below:
from pyspark.sql.functions import concat_ws
from pyspark.sql.functions import regexp_extract
list = ['n_a', 'n_A']
a.where(a.des.substr(1, 3).isin(list)).select(concat_ws(' ', regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 1), regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 2)).alias('new_column')).show(10, False)
it gives you:
+----------+
|new_column|
+----------+
|n_a 1111 |
|n_A 1211 |
|n_a 1111 |
+----------+
Hope this solved your problem.
It mainly uses two functions:
where(a.des.substr(1, 3).isin(list))
for the reqirement "select column value by start with special string"
Then the concat:
concat_ws(' ', regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 1), regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 2)).alias('new_column')
for "get the first 4 digits number to build a new column", which is using regex:
(\w\_\w).*(\d\d\d\d).*
Upvotes: 1
Reputation: 2696
Try this:
import pyspark.sql.functions as f
a=a.withColumn('des_clean', f.regexp_replace('des', "\s([^\s]+)\s", " "))
a=a.withColumn('split',f.split(f.col('des_clean'), "\s"))
a=a.withColumn('first', f.col('split').getItem(0))
a=a.withColumn('second', f.col('split').getItem(1))
a=a.filter("first in ('n_a', 'n_A')")
a.select('des_clean').show(10)
Upvotes: 0