How to get the column by its index instead of a name?

I have the following initial PySpark DataFrame:

+----------+--------------------------------+
|product_PK|                        products|
+----------+--------------------------------+
|      686 |          [[686,520.70],[645,2]]|
|      685 |[[685,45.556],[678,23],[655,21]]|
|      693 |                              []|

df = sqlCtx.createDataFrame(
    [(686, [[686,520.70], [645,2]]), (685, [[685,45.556], [678,23],[655,21]]), (693, [])],
    ["product_PK", "products"]
)

The column products contains nested data. I need to extract the second value in each pair of values. I am running this code:

temp_dataframe = dataframe.withColumn("exploded" , explode(col("products"))).withColumn("score", col("exploded").getItem("_2"))

It works well with particular DataFrame. However, I want to put this code into a function and run it on different DataFrames. All of my DataFrames have the same structure. The only difference is that the sub-column "_2" might be named differently in some DataFrames, e.g. "col1" or "col2".

For example:

DataFrame content
root
 |-- product_PK: long (nullable = true)
 |-- products: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: long (nullable = true)
 |    |    |-- _2: double (nullable = true)
 |-- exploded: struct (nullable = true)
 |    |-- _1: long (nullable = true)
 |    |-- _2: double (nullable = true)

DataFrame content
root
 |-- product_PK: long (nullable = true)
 |-- products: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- product_PK: long (nullable = true)
 |    |    |-- col2: integer (nullable = true)
 |-- exploded: struct (nullable = true)
 |    |-- product_PK: long (nullable = true)
 |    |-- col2: integer (nullable = true)

I tried to use index like getItem(1), but it says that the name of a column must be provided.

Is there any way to avoid specifying the column name or somehow generalize this part of a code?

My goal is that exploded contains the second value of each pair in the nested data, i.e. _2 or col1 or col2.

Upvotes: 4

Answers (3)

Ramesh Maharjan

Reputation: 41957

Given that your exploded column is a struct as

 |-- exploded: struct (nullable = true)
 |    |-- _1: integer (nullable = true)
 |    |-- col2: float (nullable = true)

You can use following logic to get the second element without knowing the name

from pyspark.sql import functions as F
temp_dataframe = df.withColumn("exploded" , F.explode(F.col("products")))
temp_dataframe.withColumn("score", F.col("exploded."+temp_dataframe.select(F.col("exploded.*")).columns[1]))

you should have output as

+----------+--------------------------------------+------------+------+
|product_PK|products                              |exploded    |score |
+----------+--------------------------------------+------------+------+
|686       |[[686,520.7], [645,2.0]]              |[686,520.7] |520.7 |
|686       |[[686,520.7], [645,2.0]]              |[645,2.0]   |2.0   |
|685       |[[685,45.556], [678,23.0], [655,21.0]]|[685,45.556]|45.556|
|685       |[[685,45.556], [678,23.0], [655,21.0]]|[678,23.0]  |23.0  |
|685       |[[685,45.556], [678,23.0], [655,21.0]]|[655,21.0]  |21.0  |
+----------+--------------------------------------+------------+------+

Upvotes: 0

MaxU - stand with Ukraine

Reputation: 210842

Is that what you want?

>>> df.show(10, False)
+----------+-----------------------------------------------------------------------+
|product_PK|products                                                               |
+----------+-----------------------------------------------------------------------+
|686       |[WrappedArray(686, null), WrappedArray(645, 2)]                        |
|685       |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|
|693       |[]                                                                     |
+----------+-----------------------------------------------------------------------+

>>> import pyspark.sql.functions as F
>>> df.withColumn("exploded", F.explode("products")) \
...   .withColumn("exploded", F.col("exploded").getItem(1)) \
...   .show(10,False)
+----------+-----------------------------------------------------------------------+--------+
|product_PK|products                                                               |exploded|
+----------+-----------------------------------------------------------------------+--------+
|686       |[WrappedArray(686, null), WrappedArray(645, 2)]                        |null    |
|686       |[WrappedArray(686, null), WrappedArray(645, 2)]                        |2       |
|685       |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|null    |
|685       |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|23      |
|685       |[WrappedArray(685, null), WrappedArray(678, 23), WrappedArray(655, 21)]|21      |
+----------+-----------------------------------------------------------------------+--------+

Upvotes: 1

Ryan Widmaier

Reputation: 8513

It sounds like you were on the right track. I think the way to accomplish this is to read the schema to determine the name of the field you want to explode on. Instead of schema.names though, you need to use schema.fields to find the struct field, and then use it's properties to figure out the fields in the struct. Here is an example:

from pyspark.sql.functions import *
from pyspark.sql.types import *

# Setup the test dataframe
data = [
    (686, [(686, 520.70), (645, 2.)]), 
    (685, [(685, 45.556), (678, 23.), (655, 21.)]), 
    (693, [])
]

schema = StructType([
    StructField("product_PK", StringType()),
    StructField("products", 
        ArrayType(StructType([
            StructField("_1", IntegerType()),
            StructField("col2", FloatType())
        ]))
    )
])

df = sqlCtx.createDataFrame(data, schema) 

# Find the products field in the schema, then find the name of the 2nd field
productsField = next(f for f in df.schema.fields if f.name == 'products')
target_field = productsField.dataType.elementType.names[1]

# Do your explode using the field name
temp_dataframe = df.withColumn("exploded" , explode(col("products"))).withColumn("score", col("exploded").getItem(target_field))

Now, if you examine the result you get this:

>>> temp_dataframe.printSchema()
root
 |-- product_PK: string (nullable = true)
 |-- products: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _1: integer (nullable = true)
 |    |    |-- col2: float (nullable = true)
 |-- exploded: struct (nullable = true)
 |    |-- _1: integer (nullable = true)
 |    |-- col2: float (nullable = true)
 |-- score: float (nullable = true)

Upvotes: 2

How to get the column by its index instead of a name?

Answers (3)

Related Questions