Reputation: 1103
I have a pyspark dataframe and I want to convert one of that column from string to int. Example:
Tabela 1:
+------------+-----+
|categories |value|
+------------+-----+
| red| 0.23|
| green| 0.34|
| yellow| 0.56|
| black| 0.11|
| red| 0.67|
| red| 0.34|
| green| 0.45|
+------------+-----+
Table 2:
+------------+-----+
|categ_num |value|
+------------+-----+
| 1| 0.23|
| 2| 0.34|
| 3| 0.56|
| 4| 0.11|
| 1| 0.67|
| 1| 0.34|
| 2| 0.45|
+------------+-----+
So, in that case: [red=1, green=2, yellow=3 and black=4].
But I don't know all the colors in order to assign it manually. So, I need one way to do the attribution automatically.
Could anyone help me, please?
Upvotes: 2
Views: 8322
Reputation: 143
In the case you want a solution with less code and your categories do not need to be ordered in a special way, you can use dense_rank
from the pyspark functions.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df.withColumn("categ_num", F.dense_rank().over(Window.orderBy("categories")))
Keep in mind, that window functions can cause longer runtime.
Upvotes: 1
Reputation: 1103
This code work for me:
from pyspark.ml.feature import StringIndexer
df = spark.createDataFrame(
[(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
["id", "category"])
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()
https://spark.apache.org/docs/latest/ml-features.html#stringindexer
Upvotes: 5