Abhishek Tyagi
Abhishek Tyagi

Reputation: 54

How to use split function in scala with an iteration on a column?

There will be one or more pairs of values in the column1 attribute, separated by semi-colons. For each pair of values, take the one to the right of the hyphen and generate a record for that CompanyCode. For example, if column1 = GT01-5636;GT01-7212, records would be generated for CompanyCode 5636 and 7212. All attributes other than CompanyCode are duplicated.

+--------------------
|       unparsed_data|
+--------------------
|GT01-5636;GT02-7212
|Gx01-5626;GY01-1112;GL01-4336;GT09-0012

I want my data in this form:

+--------------------
|CompanyCode
+--------------------
|5636                |
|7212                |
|5626                |
|1112                |
|4336                |
|0012                |

Upvotes: 0

Views: 158

Answers (2)

samkart
samkart

Reputation: 6654

A non-regex way using split and explode. I'm writing it in pyspark, but it'd be fairly similar in scala.

data_sdf. \
    withColumn('data_split', func.split('unparsed_data', ';')). \
    selectExpr('explode(data_split) as split_data'). \
    withColumn('company_code', func.split('split_data', '-')[1]). \
    show()

# +----------+------------+
# |split_data|company_code|
# +----------+------------+
# | GT01-5636|        5636|
# | GT02-7212|        7212|
# | Gx01-5626|        5626|
# | GY01-1112|        1112|
# | GL01-4336|        4336|
# | GT09-0012|        0012|
# +----------+------------+

Upvotes: 0

过过招
过过招

Reputation: 4244

You can use the regexp_extract_all function to extract all eligible strings, and then use the explode function to expand them.

val df1 = df.selectExpr("explode(regexp_extract_all(unparsed_data, '(.?+)-(\\\\d+)', 2)) as CompanyCode")
df1.show()

Upvotes: 1

Related Questions