Reputation: 153
I have a dataframe in spark, something like this:
ID | Column
------ | ----
1 | STRINGOFLETTERS
2 | SOMEOTHERCHARACTERS
3 | ANOTHERSTRING
4 | EXAMPLEEXAMPLE
What I would like to do is extract the first 5 characters from the column plus the 8th character and create a new column, something like this:
ID | New Column
------ | ------
1 | STRIN_F
2 | SOMEO_E
3 | ANOTH_S
4 | EXAMP_E
I can't use the following codem, because the values in the columns differ, and I don't want to split on a specific character, but on the 6th character:
import pyspark
split_col = pyspark.sql.functions.split(DF['column'], ' ')
newDF = DF.withColumn('new_column', split_col.getItem(0))
Thanks all!
Upvotes: 14
Views: 45747
Reputation: 1352
Here is the solution with Spark 3.4.0 and Python 3.11
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat,lit,substring
# Create SparkSession
spark=SparkSession.builder.getOrCreate()
# Create the dataframe with sample data
data=spark.createDataFrame(
[(1,"STRINGOFLETTERS"),
(2,"SOMEOTHERCHARACTERS"),
(3,"ANOTHERSTRING"),
(4,"EXAMPLEEXAMPLE")],
["id","column"]
)
data.show()
#+---+-------------------+
#| id| column|
#+---+-------------------+
#| 1| STRINGOFLETTERS|
#| 2|SOMEOTHERCHARACTERS|
#| 3| ANOTHERSTRING|
#| 4| EXAMPLEEXAMPLE|
#+---+-------------------+
# add new column to derive respective output
df2 = data.withColumn("new_column",concat(substring("column",1,5),lit('_'),substring("column",8,1)))
df2.select("id","new_column").show()
#+---+----------+
#| id|new_column|
#+---+----------+
#| 1| STRIN_F|
#| 2| SOMEO_E|
#| 3| ANOTH_S|
#| 4| EXAMP_E|
#+---+----------+
Upvotes: 0
Reputation: 7742
Use something like this:
df.withColumn('new_column', concat(df.Column.substr(1, 5),
lit('_'),
df.Column.substr(8, 1)))
This use the function substr and concat
Those functions will solve your problem.
Upvotes: 21