Reputation: 1
I have a list of string elements, having around 17k elements. I have to create new columns in a dataframe having integer 0 as all their elements and the columns should have the names of the elements present in the list.
How do i do this?
Example list
['V1045','71752','31231']
Format required:
ID V1045 71752 31231
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
The dataframe has around 700,000 rows.
Upvotes: 0
Views: 1029
Reputation: 43494
If you already had a dataframe, the easiest way to add columns is to use withColumn()
. You can add the value 0
to every row using pyspark.sql.functions.lit()
.
For example:
l = ['V1045','71752','31231']
for new_col in l:
df = df.withColumn(new_col, f.lit(0))
df.show(n=5)
#+---+-----+-----+-----+
#| ID|V1045|71752|31231|
#+---+-----+-----+-----+
#| 0| 0| 0| 0|
#| 1| 0| 0| 0|
#| 2| 0| 0| 0|
#| 3| 0| 0| 0|
#| 4| 0| 0| 0|
#+---+-----+-----+-----+
#only showing top 5 rows
Remember that spark is lazy, so these operations are not happening in a loop as shown here.
df.explain()
#== Physical Plan ==
#*Project [ID#111L, 0 AS V1045#114, 0 AS 71752#118, 0 AS 31231#123]
#+- Scan ExistingRDD[ID#111L]
You probably shouldn't use sc.parallelize(range())
, especially if you're using python 2 as explained in this post.
Upvotes: 0
Reputation: 45309
You can easily generate that data:
This list will be used for column names:
l = ['ID', 'V1045','71752','31231']
Then a range with required indices is created, with static zeroes used as values:
df = sc.parallelize(range(700000))\
.map(lambda l: [l, 0, 0, 0])\
.toDF(l)
When you call .show()
, it returns something like:
+---+-----+-----+-----+
| ID|V1045|71752|31231|
+---+-----+-----+-----+
| 0| 0| 0| 0|
| 1| 0| 0| 0|
| 2| 0| 0| 0|
| 3| 0| 0| 0|
| 4| 0| 0| 0|
+---+-----+-----+-----+
only showing top 5 rows
Upvotes: 1