KEXIN WANG
KEXIN WANG

Reputation: 143

Pyspark select column value by start with special string

I use Spark 2 from PySpark. The dataframe looks like

a = [('n_a xxxx 1111',0), ('n_A xxsssxx 1211',0),('n_a 1111',0),('n_c xxxx 1111',0)]
a = spark.createDataFrame(a, ['des', 'id'])
a.show(10,False)

I would like to select the row which des_column starts with 'n_a'(case insensitive) and get the first 4 digits number to build a new column the result should looks like

b = [('n_a 1111', ), ('n_A 1211', ),('n_a 1111', )]
b = spark.createDataFrame(b, ['new_column'])
b.show(10, False)

Upvotes: 0

Views: 4063

Answers (3)

Alper t. Turker
Alper t. Turker

Reputation: 35229

You can use regexp_extract

from pyspark.sql.functions import *

r = ("(?i)"            # Case insensitive
     "^(n_a)"          # Leading n_a
     "(?:\s\S+\s|\s)"  # Either whitespace string whitespace or whitespace
     "([0-9]{4})")     # Four digit number 

a.select("id", concat_ws(
    " ", 
    regexp_extract("des", r, 1),  #  n_a prefix
    regexp_extract("des", r, 2)   #  number
).alias("new_column")).where(trim(col("new_column")) != "")

which gives:

+---+----------+
| id|new_column|
+---+----------+
|  0|  n_a 1111|
|  0|  n_A 1211|
|  0|  n_a 1111|
+---+----------+

Upvotes: 3

chutium
chutium

Reputation: 427

Try the code snippet below:

from pyspark.sql.functions import concat_ws
from pyspark.sql.functions import regexp_extract
list = ['n_a', 'n_A']
a.where(a.des.substr(1, 3).isin(list)).select(concat_ws(' ', regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 1), regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 2)).alias('new_column')).show(10, False)

it gives you:

+----------+
|new_column|
+----------+
|n_a 1111  |
|n_A 1211  |
|n_a 1111  |
+----------+

Hope this solved your problem.

It mainly uses two functions:

where(a.des.substr(1, 3).isin(list))

for the reqirement "select column value by start with special string"

Then the concat:

concat_ws(' ', regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 1), regexp_extract('des', '(\w\_\w).*(\d\d\d\d).*', 2)).alias('new_column')

for "get the first 4 digits number to build a new column", which is using regex:

(\w\_\w).*(\d\d\d\d).*

Upvotes: 1

ags29
ags29

Reputation: 2696

Try this:

import pyspark.sql.functions as f

a=a.withColumn('des_clean', f.regexp_replace('des', "\s([^\s]+)\s", " "))
a=a.withColumn('split',f.split(f.col('des_clean'), "\s"))
a=a.withColumn('first', f.col('split').getItem(0))
a=a.withColumn('second', f.col('split').getItem(1))
a=a.filter("first in ('n_a', 'n_A')")
a.select('des_clean').show(10)

Upvotes: 0

Related Questions