Reputation: 350
I am trying to convert a pyspark dataframe column having approximately 90 million rows into a numpy array.
I need the array as an input for scipy.optimize.minimize
function.
I have tried both converting to Pandas and using collect()
, but these methods are very time consuming.
I am new to PySpark, If there is a faster and better approach to do this, Please help.
Thanks
This is how my dataframe looks like.
+----------+
|Adolescent|
+----------+
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
+----------+
Upvotes: 11
Views: 39302
Reputation: 10921
Another way is to convert the selected column to RDD, then flatten by extracting the value of each Row
(can abuse .keys()
), then convert to numpy array:
x = df.select("colname").rdd.map(lambda r: r[0]).collect() # python list
np.array(x) # numpy array
Upvotes: 1
Reputation: 7399
You will have to call a .collect()
in any way. To create a numpy array from the pyspark dataframe, you can use:
adoles = np.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array
You can convert it to a pandas dataframe using toPandas(), and you can then convert it to numpy array using .values
.
pdf = df.toPandas()
adoles = df["Adolescent"].values
Or simply:
adoles = df.select("Adolescent").toPandas().values #.reshape(-1) for 1-D array
For distributed arrays, you can try Dask Arrays
I haven't tested this, but assuming it would work the same as numpy (might have inconsistencies):
import dask.array as da
adoles = da.array(df.select("Adolescent").collect()) #.reshape(-1) for 1-D array
Upvotes: 27