Jay Shankar Gupta
Jay Shankar Gupta

Reputation: 6088

Spark SQL Select Distinct record from hive map data type column

I have a hive table with column type MAP, getting errors while running the below Spark SQL queries:

Table Structure:

Column_name         datatype
----------------------------------------
name                string
details             map<string,string>
df = spark.sql("""select distinct name, details from table_name""")

AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column details is map<string,string>;

df = spark.sql("""select name, details 
                  from table_name
                  group by name, details""")

AnalysisException: expression table_name.details cannot be used as a grouping expression because its data type map<string,string> is not an orderable data type.;

df = spark.sql("""
            WITH cte_row_num AS (
                SELECT name
                       ,details
                       ,ROW_NUMBER() OVER (
                              PARTITION BY name
                                          ,details 
                              ORDER BY name) as row_num 
                FROM table_name) 
            SELECT name
                  ,details 
            FROM cte_row_num 
            WHERE row_num = 1
           """)

java.lang.IllegalStateException: grouping/join/window partition keys cannot be map type.

Upvotes: 2

Views: 2217

Answers (2)

samkart
samkart

Reputation: 6644

It seems you can convert the map column to array of structs using map_entries and then take a distinct. Then convert it back to a map column.

Here's a working example

data_sdf.show()

# +----+--------+
# |name| details|
# +----+--------+
# |   n|{m -> 1}|
# |   n|{m -> 1}|
# +----+--------+

data_sdf.createOrReplaceTempView('data_tbl')

spark.sql('''
    select name, map_from_entries(details_entries) as details
    from (
        select distinct name, sort_array(map_entries(details)) as details_entries
        from data_tbl)
    '''). \
    show()

# +----+--------+
# |name| details|
# +----+--------+
# |   n|{m -> 1}|
# +----+--------+

Upvotes: 3

Jay Shankar Gupta
Jay Shankar Gupta

Reputation: 6088

You can first use ROW_NUMBER() window function to enumerate rows inside partitions and then select only rows where the ROW_NUMBER results in 1.

Example input:

df = spark.createDataFrame([('n', {'m': '1'}), ('n', {'m': '1'})], ['name', 'details'])
df.createOrReplaceTempView("table_name")
df.show()
# +----+--------+
# |name| details|
# +----+--------+
# |   n|{m -> 1}|
# |   n|{m -> 1}|
# +----+--------+

Extracting only distinct records:

df_row_num = spark.sql("""
    WITH cte_row_num AS (
        SELECT name
              ,details 
              ,ROW_NUMBER() OVER (
                  PARTITION BY name
                              ,sort_array(map_keys(details))
                              ,sort_array(map_values(details))
                  ORDER BY name) as row_num
        FROM table_name)
    SELECT name
          ,details 
    FROM cte_row_num
    WHERE row_num = 1
""")

df_row_num.show()
# +----+--------+
# |name| details|
# +----+--------+
# |   n|{m -> 1}|
# +----+--------+

Upvotes: 2

Related Questions