Morello
Morello

Reputation: 258

PySpark Dataframe extract column as an array

Is it possible to extract all of the rows of a specific column to a container of type array?

I want to be able to extract it and then reshape it as an array. Currently, the column type that I am trying to extract is of type udt.

I tried to use

my_array =  df.select(df['my_col'])

but this is not correct as it gives me a list

Upvotes: 1

Views: 3275

Answers (1)

greenie
greenie

Reputation: 444

collect_list() gives you an array of values.

A. If you want to collect all the values of a column say c2, based on another column say c1, you can group by c1 and collect values of c2 using collect_list.

df = spark.createDataFrame([
    ('emma', 'math'),
    ('emma', 'english'),
    ('mia','english'),
    ('mia','science'),
   ('mona','math'),
   ('mona','geography')
], ["student", "subject"])

from pyspark.sql.functions import collect_list
df1=df.groupBy('student').agg(collect_list('subject'))
df1.show()

B. If you want all values of c2 irrespective of any other column, you can group by a literal:

from pyspark.sql.functions import lit

df1=df.groupBy(lit(1)).agg(collect_list('subject'))
df1.show()

Upvotes: 2

Related Questions