Reputation: 237
I have a dataframe that has a column with values like below -
[[3. , 2., 1.],[3. , 1., 2.]]
I am reading this value and passing it to a udf as a pandas Series. Below is how the values of the series looks like where type of s below is <class 'pandas.core.series.Series'>
s.values = [array([array([3. , 2., 1.]),
array([3. , 1., 2.])], dtype=object)]
The shape of this shows as (1,). I want to it be of the shape 1 X 2 X 3, but using the below 2 way to try to do this gives errors as shown below -
#gives error - ValueError: cannot reshape array of size 1 into shape (1,2,3)
s.values.reshape(1,2,3)
#gives error - ValueError: cannot reshape array of size 2 into shape (1,2,3)
s_array = np.array([s.tolist()])
s_array.reshape(1,2,3)
***********Added below is the sample code where I need to reshape. It's not working completely, but executing it will give an idea of the problem.
import numpy as np
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import pandas_udf
spark = (
SparkSession
.builder
.config("spark.sql.execution.arrow.enabled", "true")
.getOrCreate()
)
l = [['s1',[[3. , 2., 1.],[3. , 1., 2.]]], ['s2',[[4. , 2., 1.],[4. , 1., 2.]]]]
df = pd.DataFrame(l, columns = ['name','lst'])
sparkDF = spark.createDataFrame(df)
S_TYPE = ArrayType(ArrayType(DoubleType()))
def test(s):
s_array = np.array([s.tolist()])
#s_array.shape = (1, 1, 2)
#ValueError: cannot reshape array of size 2 into shape (1,2,3)
s_array.reshape(1,2,3)
return s
test_udf = pandas_udf(test, S_TYPE)
df1 = sparkDF.withColumn("output", test_udf(sparkDF.lst))
I think I might have to flatten the values and then reshape. Any ideas how to achieve that? Thanks.
Upvotes: 0
Views: 1800
Reputation: 231530
Working with just the pandas part of your code:
In [138]: l = [['s1',[[3. , 2., 1.],[3. , 1., 2.]]], ['s2',[[4. , 2., 1.],[4. , 1., 2.]]]]
In [139]: df = pd.DataFrame(l, columns = ['name','lst'])
In [140]: df
Out[140]:
name lst
0 s1 [[3.0, 2.0, 1.0], [3.0, 1.0, 2.0]]
1 s2 [[4.0, 2.0, 1.0], [4.0, 1.0, 2.0]]
A Series with 2 elements:
In [141]: df['lst']
Out[141]:
0 [[3.0, 2.0, 1.0], [3.0, 1.0, 2.0]]
1 [[4.0, 2.0, 1.0], [4.0, 1.0, 2.0]]
Name: lst, dtype: object
to_numpy
makes a 2 element object dtype array; one element per element of the Series:
In [142]: df['lst'].to_numpy()
Out[142]:
array([list([[3.0, 2.0, 1.0], [3.0, 1.0, 2.0]]),
list([[4.0, 2.0, 1.0], [4.0, 1.0, 2.0]])], dtype=object)
In [143]: _.shape
Out[143]: (2,)
Or we can make a nested list from the Series:
In [144]: df['lst'].to_list()
Out[144]: [[[3.0, 2.0, 1.0], [3.0, 1.0, 2.0]], [[4.0, 2.0, 1.0], [4.0, 1.0, 2.0]]]
Making an array from that list is easy (especially if the nesting of the sublists is all the same):
In [145]: np.array(df['lst'].to_list())
Out[145]:
array([[[3., 2., 1.],
[3., 1., 2.]],
[[4., 2., 1.],
[4., 1., 2.]]])
In [146]: _.shape
Out[146]: (2, 2, 3)
The to_numpy
list, being 1d, can also be stack
:
In [147]: np.stack(df['lst'].to_numpy())
Out[147]:
array([[[3., 2., 1.],
[3., 1., 2.]],
[[4., 2., 1.],
[4., 1., 2.]]])
np.stack
is a concatenate
version that joins the lists (or lists made into arrays) on a new axis. By default it is a lot like np.array
; here it is better from 'flattening' the nesting.
Most of this works if l
contained arrays instead of nested lists.
To makes something closer to your initial s.values
:
In [174]: alist = [np.empty(2, object)]
In [175]: alist[0][:] = [np.array([3,2,1]),np.array([3,1,2])]
In [176]: alist
Out[176]: [array([array([3, 2, 1]), array([3, 1, 2])], dtype=object)]
stack
of the list doesn't change much (just makes a (1,2) array):
In [177]: np.stack(alist)
Out[177]: array([[array([3, 2, 1]), array([3, 1, 2])]], dtype=object)
but a stack
of that one element in the list:
In [178]: np.stack(alist[0])
Out[178]:
array([[3, 2, 1],
[3, 1, 2]])
Sometimes if the nesting of lists and arrays in complicated, we have to try several things. Pay close attention to the distinction between list and array, and to the len
and/or shape
at each level.
Let's look at how the initial shape of an object array affects the 'stack' unpacking.
In [278]: df
Out[278]:
name lst
0 s1 [[3.0, 2.0, 1.0], [3.0, 1.0, 2.0]]
1 s2 [[4.0, 2.0, 1.0], [4.0, 1.0, 2.0]]
If I select a dataframe column by name I get a Series:
In [279]: df['lst']
Out[279]:
0 [[3.0, 2.0, 1.0], [3.0, 1.0, 2.0]]
1 [[4.0, 2.0, 1.0], [4.0, 1.0, 2.0]]
Name: lst, dtype: object
The numpy
rendition is a 1d array:
In [280]: df['lst'].to_numpy()
Out[280]:
array([list([array([3., 2., 1.]), array([3., 1., 2.])]),
array([[4., 2., 1.],
[4., 1., 2.]])], dtype=object)
In [281]: _.shape
Out[281]: (2,)
If instead I select a column by list, I get a dataframe:
In [282]: df[['lst']]
Out[282]:
lst
0 [[3.0, 2.0, 1.0], [3.0, 1.0, 2.0]]
1 [[4.0, 2.0, 1.0], [4.0, 1.0, 2.0]]
This numpy
is 2d:
In [283]: df[['lst']].to_numpy()
Out[283]:
array([[list([array([3., 2., 1.]), array([3., 1., 2.])])],
[array([[4., 2., 1.],
[4., 1., 2.]])]], dtype=object)
In [284]: _.shape
Out[284]: (2, 1)
stack
of the 1d array unpacks it and creates a 3d array - one dimension from the outer array, and two from the inner ones:
In [285]: np.stack(_280)
Out[285]:
array([[[3., 2., 1.],
[3., 1., 2.]],
[[4., 2., 1.],
[4., 1., 2.]]])
but stack of the 2d doesn't change anything:
In [286]: np.stack(_283)
Out[286]:
array([[list([array([3., 2., 1.]), array([3., 1., 2.])])],
[array([[4., 2., 1.],
[4., 1., 2.]])]], dtype=object)
We have to first make it 1d, either with ravel, reshape, or indexing:
In [287]: np.stack(_283.ravel())
Out[287]:
array([[[3., 2., 1.],
[3., 1., 2.]],
[[4., 2., 1.],
[4., 1., 2.]]])
I haven't followed your code in enough detail to say exactly what's going on, but hopefully this gives you an idea of what to watch out for. You need a clear idea of the shape and dtype of an array, and same for any nested arrays.
Upvotes: 1