pythonapache-sparkpysparkapache-spark-sql

Reputation: 5907

Show distinct column values in pyspark dataframe

With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique().

I want to list out all the unique values in a pyspark dataframe column.

Not the SQL type way (registertemplate then SQL query for distinct values).

Also I don't need groupby then countDistinct, instead I want to check distinct VALUES in that column.

Upvotes: 230

Answers (15)

Lisa

Reputation: 107

This one-liner will give you a list of unique values:

unique_list = [i['col_name'] for i in df.select('col_name').distinct().collect()]

Upvotes: 4

Aymen Azoui

Reputation: 392

I did this :

unique_values = df.select("col").distinct().rdd.flatMap(lambda x: x).collect()
print(unique_values)

Upvotes: 0

eddies

Reputation: 7483

Let's assume we're working with the following representation of data (two columns, k and v, where k contains three entries, two unique:

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|bar|  2|
|foo|  3|
+---+---+

With a Pandas dataframe:

import pandas as pd
p_df = pd.DataFrame([("foo", 1), ("bar", 2), ("foo", 3)], columns=("k", "v"))
p_df['k'].unique()

This returns an ndarray, i.e. array(['foo', 'bar'], dtype=object)

You asked for a "pyspark dataframe alternative for pandas df['col'].unique()". Now, given the following Spark dataframe:

s_df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("foo", 3)], ('k', 'v'))

If you want the same result from Spark, i.e. an ndarray, use toPandas():

s_df.toPandas()['k'].unique()

Alternatively, if you don't need an ndarray specifically and just want a list of the unique values of column k:

s_df.select('k').distinct().rdd.map(lambda r: r[0]).collect()

Finally, you can also use a list comprehension as follows:

[i for i in s_df.select('k').distinct().collect()]

Upvotes: 129

Afonso O. Lenzi

Reputation: 99

This can be helpful

df.select(collect_set("your column")).show()

Upvotes: 1

travelingbones

Reputation: 8448

I find this to be the most intuitive (i.e., similar to Pandas) syntax:

new_df = df[['col']].distinct() # returns a one-column pyspark dataframe

new_col = df[['col']].distinct()['col'] # returns a pyspark column object.

So to get those to a list you need:

new_list = df[['col']].distinct().rdd.map(lambda row: row.col).collect()

Upvotes: 3

travelingbones

Reputation: 8448

Similar to other answer, but the question doesn't seem to want Row objects returned, but instead actual values.

The ideal one-liner is df.select('column').distinct().collect().toPandas().column.to_list() assuming that running the .collect() isn't going to be too big for memory.

I recommend a df.select('column').distinct().count() first to estimate size, and make sure it's not too huge beforehand.

Upvotes: 4

Hari Baskar

Reputation: 426

collect_set can help to get unique values from a given column of pyspark.sql.DataFrame:

df.select(F.collect_set("column").alias("column")).first()["column"]

Upvotes: 14

Nidhi

Reputation: 631

If you want to see the distinct values of a specific column in your dataframe, you would just need to write the following code. It would show the 100 distinct values (if 100 values are available) for the colname column in the df dataframe.

df.select('colname').distinct().show(100, False)

If you want to do something fancy on the distinct values, you can save the distinct values in a vector:

a = df.select('colname').distinct()

Upvotes: 24

Marioanzas

Reputation: 1945

Let us suppose that your original DataFrame is called df. Then, you can use:

df1 = df.groupBy('column_1').agg(F.count('column_1').alias('trip_count'))
df2 = df1.sort(df1.trip_count.desc()).show()

Upvotes: 0

Pabbati

Reputation: 4189

This should help to get distinct values of a column:

df.select('column1').distinct().collect()

Note that .collect() doesn't have any built-in limit on how many values can return so this might be slow -- use .show() instead or add .limit(20) before .collect() to manage this.

Upvotes: 419

Joseph Jacob

Reputation: 183

Run this first

df.createOrReplaceTempView('df')

Then run

spark.sql("""
    SELECT distinct
        column name
    FROM
        df
    """).show()

Upvotes: 3

muon

Reputation: 14067

you could do

distinct_column = 'somecol' 

distinct_column_vals = df.select(distinct_column).distinct().collect()
distinct_column_vals = [v[distinct_column] for v in distinct_column_vals]

Upvotes: 14

Kapil Sharma

Reputation: 41

If you want to select ALL(columns) data as distinct frrom a DataFrame (df), then

df.select('*').distinct().show(10,truncate=False)

Upvotes: 3

ansev

Reputation: 30930

In addition to the dropDuplicates option there is the method named as we know it in pandas drop_duplicates:

drop_duplicates() is an alias for dropDuplicates().

Example

s_df = sqlContext.createDataFrame([("foo", 1),
                                   ("foo", 1),
                                   ("bar", 2),
                                   ("foo", 3)], ('k', 'v'))
s_df.show()

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|foo|  1|
|bar|  2|
|foo|  3|
+---+---+

Drop by subset

s_df.drop_duplicates(subset = ['k']).show()

+---+---+
|  k|  v|
+---+---+
|bar|  2|
|foo|  1|
+---+---+
s_df.drop_duplicates().show()


+---+---+
|  k|  v|
+---+---+
|bar|  2|
|foo|  3|
|foo|  1|
+---+---+

Upvotes: 4

seufagner

Reputation: 1370

You can use df.dropDuplicates(['col1','col2']) to get only distinct rows based on colX in the array.

Upvotes: 27

Show distinct column values in pyspark dataframe

Answers (15)

Related Questions