Phong Nguyen
Phong Nguyen

Reputation: 181

How to make an integer index row?

I have a DataFrame:

+-----+--------+---------+
|  usn|log_type|item_code|
+-----+--------+---------+
|    0|      11|    I0938|
|  916|      19|    I0009|
|  916|      51|    I1097|
|  916|      19|    C0723|
|  916|      19|    I0010|
|  916|      19|    I0010|
|12331|      19|    C0117|
|12331|      19|    C0117|
|12331|      19|    I0009|
|12331|      19|    I0009|
|12331|      19|    I0010|
|12838|      19|    I1067|
|12838|      19|    I1067|
|12838|      19|    C1083|
|12838|      11|    B0250|
|12838|      19|    C1346|
+-----+--------+---------+

And I want distinct item_code and make an index for each item_code like this:

+---------+------+
|item_code| numId|
+---------+------+
|    I0938|   0  |
|    I0009|   1  |
|    I1097|   2  |
|    C0723|   3  |
|    I0010|   4  |
|    C0117|   5  | 
|    I1067|   6  |
|    C1083|   7  |
|    B0250|   8  | 
|    C1346|   9  |
+---------+------+

I don't use monotonically_increasing_id because it returns a bigint.

Upvotes: 0

Views: 424

Answers (1)

Shaido
Shaido

Reputation: 28332

Using monotanicallly_increasing_id only guarantees that the numbers are increasing, the starting number and consecutive numbering is not guaranteed. If you want to be sure to get 0,1,2,3,... you can use the RDD function zipWithIndex().

Since I'm not too familiar with spark together with python, the below example is using scala but it should be easy to convert it.

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._

val df = Seq("I0938","I0009","I1097","C0723","I0010","I0010",
    "C0117","C0117","I0009","I0009","I0010","I1067",
    "I1067","C1083","B0250","C1346")
  .toDF("item_code")

val df2 = df.distinct.rdd
  .map{case Row(item: String) => item}
  .zipWithIndex()
  .toDF("item_code", "numId")

Which will give you the requested result:

+---------+-----+
|item_code|numId|
+---------+-----+
|    I0010|    0|
|    I1067|    1|
|    C0117|    2|
|    I0009|    3|
|    I1097|    4|
|    C1083|    5|
|    I0938|    6|
|    C0723|    7|
|    B0250|    8|
|    C1346|    9|
+---------+-----+

Upvotes: 3

Related Questions