Reputation: 539
I have a dataframe as below:
>>> df
Name id
0 Tom 103
1 Jack 109
2 nick 9518
3 juli 1890
I want to create a ne column as super_id which is i) if id is 3 digits then super_id is zero plus the first integer ii) if id is 4 digits then super id is first two integers.
>>> df
Name id super_id
0 Tom 103 01
1 Jack 109 01
2 nick 9518 95
3 juli 1890 18
I have the below python code for the same but not sure how to convert it into pyspark code.
import pandas as pd
# initialise data of lists.
data = {'Name':['Tom', 'Jack', 'nick', 'juli'],
'id':[103, 109, 9518, 1890]}
# Creates pandas DataFrame.
df = pd.DataFrame(data)
#Create super id
df['super_id'] = df.id.astype('int').astype('str').str.zfill(4).str[0:2]
Attempted in pyspark with error
df= df.withColumn('super_id', df['id'].astype('int').astype('str').str.zfill(4).str[0:2])
Upvotes: 0
Views: 102
Reputation: 15258
You need to use spark functions to do that :
from pyspark.sql import functions as F
df.withColumn("super_id", F.substring(F.lpad("id", 4, "0"), 0, 2)).show()
+-----+----+--------+
| name| id|super_id|
+-----+----+--------+
| Tom| 103| 01|
| jack| 109| 01|
| nick|9518| 95|
|julie|1890| 18|
+-----+----+--------+
Upvotes: 1