Reputation: 1

Creating columns in a PySpark Dataframe having names of the elements in a list

I have a list of string elements, having around 17k elements. I have to create new columns in a dataframe having integer 0 as all their elements and the columns should have the names of the elements present in the list.

How do i do this?

Example list

['V1045','71752','31231']

Format required:

ID    V1045   71752    31231
1     0       0        0
2     0       0        0
3     0       0        0
4     0       0        0

The dataframe has around 700,000 rows.

Upvotes: 0

Answers (2)

pault

Reputation: 43494

If you already had a dataframe, the easiest way to add columns is to use withColumn(). You can add the value 0 to every row using pyspark.sql.functions.lit().

For example:

l = ['V1045','71752','31231']
for new_col in l:
    df = df.withColumn(new_col, f.lit(0))

df.show(n=5)
#+---+-----+-----+-----+
#| ID|V1045|71752|31231|
#+---+-----+-----+-----+
#|  0|    0|    0|    0|
#|  1|    0|    0|    0|
#|  2|    0|    0|    0|
#|  3|    0|    0|    0|
#|  4|    0|    0|    0|
#+---+-----+-----+-----+
#only showing top 5 rows

Remember that spark is lazy, so these operations are not happening in a loop as shown here.

df.explain()
#== Physical Plan ==
#*Project [ID#111L, 0 AS V1045#114, 0 AS 71752#118, 0 AS 31231#123]
#+- Scan ExistingRDD[ID#111L]

You probably shouldn't use sc.parallelize(range()), especially if you're using python 2 as explained in this post.

Upvotes: 0

ernest_k

Reputation: 45309

You can easily generate that data:

This list will be used for column names:

l = ['ID', 'V1045','71752','31231']

Then a range with required indices is created, with static zeroes used as values:

df = sc.parallelize(range(700000))\
       .map(lambda l: [l, 0, 0, 0])\
       .toDF(l)

When you call .show(), it returns something like:

+---+-----+-----+-----+
| ID|V1045|71752|31231|
+---+-----+-----+-----+
|  0|    0|    0|    0|
|  1|    0|    0|    0|
|  2|    0|    0|    0|
|  3|    0|    0|    0|
|  4|    0|    0|    0|
+---+-----+-----+-----+
only showing top 5 rows

Upvotes: 1

Creating columns in a PySpark Dataframe having names of the elements in a list

Answers (2)

Related Questions