NITS
NITS

Reputation: 237

Reshape a pandas Series

I have a dataframe that has a column with values like below -

[[3. , 2., 1.],[3. , 1., 2.]]

I am reading this value and passing it to a udf as a pandas Series. Below is how the values of the series looks like where type of s below is <class 'pandas.core.series.Series'>

s.values = [array([array([3. , 2., 1.]),
       array([3. , 1., 2.])], dtype=object)]

The shape of this shows as (1,). I want to it be of the shape 1 X 2 X 3, but using the below 2 way to try to do this gives errors as shown below -

#gives error - ValueError: cannot reshape array of size 1 into shape (1,2,3)
s.values.reshape(1,2,3)

#gives error - ValueError: cannot reshape array of size 2 into shape (1,2,3)
s_array = np.array([s.tolist()])
s_array.reshape(1,2,3)

***********Added below is the sample code where I need to reshape. It's not working completely, but executing it will give an idea of the problem.


import numpy as np
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import pandas_udf

spark = (
    SparkSession
    .builder
    .config("spark.sql.execution.arrow.enabled", "true")
    .getOrCreate()
    )

l = [['s1',[[3. , 2., 1.],[3. , 1., 2.]]], ['s2',[[4. , 2., 1.],[4. , 1., 2.]]]]
df = pd.DataFrame(l, columns = ['name','lst']) 

sparkDF =  spark.createDataFrame(df)

S_TYPE = ArrayType(ArrayType(DoubleType()))
def test(s):
   s_array = np.array([s.tolist()])
   #s_array.shape = (1, 1, 2)
   #ValueError: cannot reshape array of size 2 into shape (1,2,3)
   s_array.reshape(1,2,3)
   return s

test_udf = pandas_udf(test, S_TYPE)

df1 = sparkDF.withColumn("output", test_udf(sparkDF.lst))

I think I might have to flatten the values and then reshape. Any ideas how to achieve that? Thanks.

Upvotes: 0

Views: 1800

Answers (1)

hpaulj
hpaulj

Reputation: 231530

Working with just the pandas part of your code:

In [138]: l = [['s1',[[3. , 2., 1.],[3. , 1., 2.]]], ['s2',[[4. , 2., 1.],[4. , 1., 2.]]]]           
In [139]: df = pd.DataFrame(l, columns = ['name','lst'])                                             
In [140]: df                                                                                         
Out[140]: 
  name                                 lst
0   s1  [[3.0, 2.0, 1.0], [3.0, 1.0, 2.0]]
1   s2  [[4.0, 2.0, 1.0], [4.0, 1.0, 2.0]]

A Series with 2 elements:

In [141]: df['lst']                                                                                  
Out[141]: 
0    [[3.0, 2.0, 1.0], [3.0, 1.0, 2.0]]
1    [[4.0, 2.0, 1.0], [4.0, 1.0, 2.0]]
Name: lst, dtype: object

to_numpy makes a 2 element object dtype array; one element per element of the Series:

In [142]: df['lst'].to_numpy()                                                                       
Out[142]: 
array([list([[3.0, 2.0, 1.0], [3.0, 1.0, 2.0]]),
       list([[4.0, 2.0, 1.0], [4.0, 1.0, 2.0]])], dtype=object)
In [143]: _.shape                                                                                    
Out[143]: (2,)

Or we can make a nested list from the Series:

In [144]: df['lst'].to_list()                                                                        
Out[144]: [[[3.0, 2.0, 1.0], [3.0, 1.0, 2.0]], [[4.0, 2.0, 1.0], [4.0, 1.0, 2.0]]]

Making an array from that list is easy (especially if the nesting of the sublists is all the same):

In [145]: np.array(df['lst'].to_list())                                                              
Out[145]: 
array([[[3., 2., 1.],
        [3., 1., 2.]],

       [[4., 2., 1.],
        [4., 1., 2.]]])
In [146]: _.shape                                                                                    
Out[146]: (2, 2, 3)

The to_numpy list, being 1d, can also be stack:

In [147]: np.stack(df['lst'].to_numpy())                                                             
Out[147]: 
array([[[3., 2., 1.],
        [3., 1., 2.]],

       [[4., 2., 1.],
        [4., 1., 2.]]])

np.stack is a concatenate version that joins the lists (or lists made into arrays) on a new axis. By default it is a lot like np.array; here it is better from 'flattening' the nesting.

Most of this works if l contained arrays instead of nested lists.

other

To makes something closer to your initial s.values:

In [174]: alist = [np.empty(2, object)]                                                              
In [175]: alist[0][:] = [np.array([3,2,1]),np.array([3,1,2])]                                        
In [176]: alist                                                                                      
Out[176]: [array([array([3, 2, 1]), array([3, 1, 2])], dtype=object)]

stack of the list doesn't change much (just makes a (1,2) array):

In [177]: np.stack(alist)                                                                            
Out[177]: array([[array([3, 2, 1]), array([3, 1, 2])]], dtype=object)

but a stack of that one element in the list:

In [178]: np.stack(alist[0])                                                                         
Out[178]: 
array([[3, 2, 1],
       [3, 1, 2]])

Sometimes if the nesting of lists and arrays in complicated, we have to try several things. Pay close attention to the distinction between list and array, and to the len and/or shape at each level.

edit

Let's look at how the initial shape of an object array affects the 'stack' unpacking.

In [278]: df                                                                                         
Out[278]: 
  name                                 lst
0   s1  [[3.0, 2.0, 1.0], [3.0, 1.0, 2.0]]
1   s2  [[4.0, 2.0, 1.0], [4.0, 1.0, 2.0]]

If I select a dataframe column by name I get a Series:

In [279]: df['lst']                                                                                  
Out[279]: 
0    [[3.0, 2.0, 1.0], [3.0, 1.0, 2.0]]
1    [[4.0, 2.0, 1.0], [4.0, 1.0, 2.0]]
Name: lst, dtype: object

The numpy rendition is a 1d array:

In [280]: df['lst'].to_numpy()                                                                       
Out[280]: 
array([list([array([3., 2., 1.]), array([3., 1., 2.])]),
       array([[4., 2., 1.],
       [4., 1., 2.]])], dtype=object)
In [281]: _.shape                                                                                    
Out[281]: (2,)

If instead I select a column by list, I get a dataframe:

In [282]: df[['lst']]                                                                                
Out[282]: 
                                  lst
0  [[3.0, 2.0, 1.0], [3.0, 1.0, 2.0]]
1  [[4.0, 2.0, 1.0], [4.0, 1.0, 2.0]]

This numpy is 2d:

In [283]: df[['lst']].to_numpy()                                                                     
Out[283]: 
array([[list([array([3., 2., 1.]), array([3., 1., 2.])])],
       [array([[4., 2., 1.],
       [4., 1., 2.]])]], dtype=object)
In [284]: _.shape                                                                                    
Out[284]: (2, 1)

stack of the 1d array unpacks it and creates a 3d array - one dimension from the outer array, and two from the inner ones:

In [285]: np.stack(_280)                                                                             
Out[285]: 
array([[[3., 2., 1.],
        [3., 1., 2.]],

       [[4., 2., 1.],
        [4., 1., 2.]]])

but stack of the 2d doesn't change anything:

In [286]: np.stack(_283)                                                                             
Out[286]: 
array([[list([array([3., 2., 1.]), array([3., 1., 2.])])],
       [array([[4., 2., 1.],
       [4., 1., 2.]])]], dtype=object)

We have to first make it 1d, either with ravel, reshape, or indexing:

In [287]: np.stack(_283.ravel())                                                                     
Out[287]: 
array([[[3., 2., 1.],
        [3., 1., 2.]],

       [[4., 2., 1.],
        [4., 1., 2.]]])

I haven't followed your code in enough detail to say exactly what's going on, but hopefully this gives you an idea of what to watch out for. You need a clear idea of the shape and dtype of an array, and same for any nested arrays.

Upvotes: 1

Related Questions