Reputation: 45
I have a spark dataframe which has a column 'X'.The column contains elements which are in the form:
u'[23,4,77,890,455,................]'
. How can I convert this unicode to list.That is my output should be
[23,4,77,890,455...................]
. I have apply it for each element in the 'X' column.
I have tried df.withColumn("X_new", ast.literal_eval(x)) and got the error
"Malformed String"
I also tried
df.withColumn("X_new", json.loads(x)) and got the error "Expected String or Buffer"
and
df.withColumn("X_new", json.dumps(x)) which says JSON not serialisable.
and also
df_2 = df.rdd.map(lambda x: x.encode('utf-8')) which says rdd has no attribute encode.
I dont want to use collect and toPandas() because its memory consuming.(But if thats the only way please do tell).I am using Pyspark
Update: cph_sto gave the answer using UDF.Though it worked well,I find that it is Slow.Can Somebody suggest any other method?
Upvotes: 1
Views: 10210
Reputation: 83
Since it's a string, you could remove the first and last characters:
From '[23,4,77,890,455]'
to '23,4,77,890,455'
Then apply the split()
function to generate an array, taking ,
as the delimiter.
Upvotes: 1
Reputation: 7585
import ast
from pyspark.sql.functions import udf
values = [(u'[23,4,77,890.455]',10),(u'[11,2,50,1.11]',20),(u'[10.05,1,22.04]',30)]
df = sqlContext.createDataFrame(values,['list','A'])
df.show()
+-----------------+---+
| list| A|
+-----------------+---+
|[23,4,77,890.455]| 10|
| [11,2,50,1.11]| 20|
| [10.05,1,22.04]| 30|
+-----------------+---+
# Creating a UDF to convert the string list to proper list
string_list_to_list = udf(lambda row: ast.literal_eval(row))
df = df.withColumn('list',string_list_to_list(col('list')))
df.show()
+--------------------+---+
| list| A|
+--------------------+---+
|[23, 4, 77, 890.455]| 10|
| [11, 2, 50, 1.11]| 20|
| [10.05, 1, 22.04]| 30|
+--------------------+---+
Extension of the Q
, as asked by OP -
# Creating a UDF to find length of resulting list.
length_list = udf(lambda row: len(row))
df = df.withColumn('length_list',length_list(col('list')))
df.show()
+--------------------+---+-----------+
| list| A|length_list|
+--------------------+---+-----------+
|[23, 4, 77, 890.455]| 10| 4|
| [11, 2, 50, 1.11]| 20| 4|
| [10.05, 1, 22.04]| 30| 3|
+--------------------+---+-----------+
Upvotes: 2
Reputation: 1
Please use the below code to ignore unicode
df.rdd.map(lambda x: x.encode("ascii","ignore"))
Upvotes: -2