Amanda C
Amanda C

Reputation: 153

Python spark extract characters from dataframe

I have a dataframe in spark, something like this:

ID     | Column
------ | ----
1      | STRINGOFLETTERS
2      | SOMEOTHERCHARACTERS
3      | ANOTHERSTRING
4      | EXAMPLEEXAMPLE

What I would like to do is extract the first 5 characters from the column plus the 8th character and create a new column, something like this:

ID     | New Column
------ | ------
1      | STRIN_F
2      | SOMEO_E
3      | ANOTH_S
4      | EXAMP_E

I can't use the following codem, because the values in the columns differ, and I don't want to split on a specific character, but on the 6th character:

import pyspark
split_col = pyspark.sql.functions.split(DF['column'], ' ')
newDF = DF.withColumn('new_column', split_col.getItem(0))

Thanks all!

Upvotes: 14

Views: 45747

Answers (2)

Vijay_Shinde
Vijay_Shinde

Reputation: 1352

Here is the solution with Spark 3.4.0 and Python 3.11

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat,lit,substring

# Create SparkSession
spark=SparkSession.builder.getOrCreate()

# Create the dataframe with sample data                                
data=spark.createDataFrame(
        [(1,"STRINGOFLETTERS"),
        (2,"SOMEOTHERCHARACTERS"),
        (3,"ANOTHERSTRING"),
        (4,"EXAMPLEEXAMPLE")],
        ["id","column"]
        )
        
data.show()
#+---+-------------------+
#| id|             column|
#+---+-------------------+
#|  1|    STRINGOFLETTERS|
#|  2|SOMEOTHERCHARACTERS|
#|  3|      ANOTHERSTRING|
#|  4|     EXAMPLEEXAMPLE|
#+---+-------------------+

# add new column to derive respective output
df2 = data.withColumn("new_column",concat(substring("column",1,5),lit('_'),substring("column",8,1)))

df2.select("id","new_column").show()
#+---+----------+
#| id|new_column|
#+---+----------+
#|  1|   STRIN_F|
#|  2|   SOMEO_E|
#|  3|   ANOTH_S|
#|  4|   EXAMP_E|
#+---+----------+

Upvotes: 0

Thiago Baldim
Thiago Baldim

Reputation: 7742

Use something like this:

df.withColumn('new_column', concat(df.Column.substr(1, 5),
                                   lit('_'),
                                   df.Column.substr(8, 1)))

This use the function substr and concat

Those functions will solve your problem.

Upvotes: 21

Related Questions