Reputation: 1323
I have the following source file. I have a name called "john
" in my file wanted to split to list ['j','o','h','n']
. Please find the person file as follows.
Source File:
id,name,class,start_data,end_date
1,john,xii,20170909,20210909
Code:
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.appName("PersonProcessing").getOrCreate()
df = spark.read.csv('person.txt', header=True)
nameList = [x['name'] for x in df.rdd.collect()]
print(list(nameList))
df.show()
if __name__ == '__main__':
main()
Actual Output:
[u'john']
Desired Output:
['j','o','h','n']
Upvotes: 2
Views: 362
Reputation: 67
If you are doing this in spark scala (spark 2.3.1 & scala-2.11.8 ) Below code works. We will get an extra record with blank name hence filtering it .
import spark.implicits._ val classDF = spark.sparkContext.parallelize(Seq((1, "John", "Xii", "20170909", "20210909"))) .toDF("ID", "Name", "Class", "Start_Date", "End_Date")
classDF.withColumn("Name", explode((split(trim(col("Name")), ""))))
.withColumn("Start_Date", to_date(col("Start_Date"), "yyyyMMdd"))
.withColumn("End_Date", to_date(col("End_Date"), "yyyyMMdd")).filter(col("Name").=!=("")).show
Upvotes: 0
Reputation: 1497
If you want to in python:
nameList = [c for x in df.rdd.collect() for c in x['name']]
or If you want to do it in spark:
from pyspark.sql import functions as F
df.withColumn('name', F.split(F.col('name'), '')).show()
Result:
+---+--------------+-----+----------+--------+
| id| name|class|start_data|end_date|
+---+--------------+-----+----------+--------+
| 1|[j, o, h, n, ]| xii| 20170909|20210909|
+---+--------------+-----+----------+--------+
Upvotes: 5
Reputation: 805
.tolist() turns a pandas series into a python list, so you should create a list first from the data and loop over the list created.
namelist=df['name'].tolist()
for x in namelist:
print(x)
Upvotes: 0