Reputation: 9892
I have a dataframe with 2 columns and I got below array by doing df.collect().
array = [Row(name=u'Alice', age=10), Row(name=u'Bob', age=15)]
Now I want to get an output array like below.
new_array = ['Alice', 'Bob']
Could anyone please let me know how to extract above output using pyspark. Any help would be appreciated.
Thanks
Upvotes: 1
Views: 7190
Reputation: 7585
# Creating the base dataframe.
values = [('Alice',10),('Bob',15)]
df = sqlContext.createDataFrame(values,['name','age'])
df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 10|
| Bob| 15|
+-----+---+
df.collect()
[Row(name='Alice', age=10), Row(name='Bob', age=15)]
# Use list comprehensions to create a list.
new_list = [row.name for row in df.collect()]
print(new_list)
['Alice', 'Bob']
Upvotes: 3
Reputation: 1588
I see two columns name and age in the df. Now, you want only the name
column to be displayed.
You can select it like:
df.select("name").show()
This will show you only the names.
Tip: Also, you df.show()
instead of df.collect()
. That will show you in tabular form instead of row(...)
Upvotes: 0