Reputation: 1944
I have data with column foo which can be
foo
abcdef_zh
abcdf_grtyu_zt
pqlmn@xl
from here I want to create two columns such that
Part 1 Part 2
abcdef zh
abcdf_grtyu zt
pqlmn xl
The code I am using for this is
data = data.withColumn("Part 1",split(data["foo"],substring(data["foo"],-3,1))).get_item(0)
data = data.withColumn("Part 2",split(data["foo"],substring(data["foo"],-3,1))).get_item(1)
However I am getting an error column not iterable
Upvotes: 0
Views: 2649
Reputation: 1
pyspark udf code to split by last delimiter
@F.udf(returnType=T.ArrayType(T.StringType()))
def split_by_last_delm(str, delimiter):
if str is None:
return None
split_array = str.rsplit(delimiter, 1)
return split_array
data = data.withColumn("Part 1",split_by_last_delm(data["foo"],lit('_')).getItem(0))
data2 = data.withColumn("Part 2",split_by_last_delm(data["foo"],lit('_')).getItem(1))
Upvotes: 0
Reputation: 11587
The following should work
>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import expr
>>> df = sc.parallelize(['abcdef_zh', 'abcdfgrtyu_zt', 'pqlmn@xl']).map(lambda x: Row(x)).toDF(["col1"])
>>> df.show()
+-------------+
| col1|
+-------------+
| abcdef_zh|
|abcdfgrtyu_zt|
| pqlmn@xl|
+-------------+
>>> df.withColumn('part2',df.col1.substr(-2, 3)).withColumn('part1', expr('substr(col1, 1, length(col1)-3)')).select('part1', 'part2').show()
+----------+-----+
| part1|part2|
+----------+-----+
| abcdef| zh|
|abcdfgrtyu| zt|
| pqlmn| xl|
+----------+-----+
Upvotes: 1