Spark SQL Select Distinct record from hive map data type column

Question

I have a hive table with column type MAP, getting errors while running the below Spark SQL queries:

Table Structure:

Column_name         datatype
----------------------------------------
name                string
details             map

df = spark.sql("""select distinct name, details from table_name""")

AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column details is map;

df = spark.sql("""select name, details 
                  from table_name
                  group by name, details""")

AnalysisException: expression table_name.details cannot be used as a grouping expression because its data type map is not an orderable data type.;

df = spark.sql("""
            WITH cte_row_num AS (
                SELECT name
                       ,details
                       ,ROW_NUMBER() OVER (
                              PARTITION BY name
                                          ,details 
                              ORDER BY name) as row_num 
                FROM table_name) 
            SELECT name
                  ,details 
            FROM cte_row_num 
            WHERE row_num = 1
           """)

java.lang.IllegalStateException: grouping/join/window partition keys cannot be map type.

samkart · Accepted Answer

It seems you can convert the map column to array of structs using map_entries and then take a distinct. Then convert it back to a map column.

Here's a working example

data_sdf.show()

# +----+--------+
# |name| details|
# +----+--------+
# |   n|{m -> 1}|
# |   n|{m -> 1}|
# +----+--------+

data_sdf.createOrReplaceTempView('data_tbl')

spark.sql('''
    select name, map_from_entries(details_entries) as details
    from (
        select distinct name, sort_array(map_entries(details)) as details_entries
        from data_tbl)
    '''). \
    show()

# +----+--------+
# |name| details|
# +----+--------+
# |   n|{m -> 1}|
# +----+--------+

Spark SQL Select Distinct record from hive map data type column

Answers (2)

Related Questions