Reputation: 171
How can I change the value of a column depending on some validation between some cells? What I need is to compare the kilometraje values of each customer's (id) record to compare whether the record that follows the kilometraje is higher.
fecha id estado id_cliente error_code kilometraje error_km
1/1/2019 1 A 1 10
2/1/2019 2 A ERROR 20
3/1/2019 1 D 1 ERROR 30
4/1/2019 2 O ERROR
The error in the error_km column is because for customer (id) 2 the kilometraje value is less than the same customer record for 2/1/2019 (If time passes the car is used so the kilometraje increases, so that there is no error the mileage has to be higher or the same)
I know that withColumn I can overwrite or create a column that doesn't exist and that using when I can set conditions. For example: This would be the code I use to validate the estado and id_cliente column and ERROR overwrite the error_code column where applicable, but I don't understand how to validate between different rows for the same client.
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
file_path = 'archive.txt'
error = 'ERROR'
df = spark.read.parquet(file_path)
df = df.persist(StorageLevel.MEMORY_AND_DISK)
df = df.select('estado', 'id_cliente')
df = df.withColumn("error_code", lit(''))
df = df.withColumn('error_code',
F.when((F.col('status') == 'O') &
(F.col('client_id') != '') |
(F.col('status') == 'D') &
(F.col('client_id') != '') |
(F.col('status') == 'A') &
(F.col('client_id') == ''),
F.concat(F.col("error_code"), F.lit(":[{}]".format(error)))
)
.otherwise(F.col('error_code')))
Upvotes: 0
Views: 697
Reputation: 171
I use .rangeBetween(Window.unboundedPreceding,0).
This function searches from the current value for the added value for the back
import pyspark
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.getOrCreate()
error = 'This is error'
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 22 ),
('7/1/2019', 1 , 23 ),
('22/1/2019', 2 , 5),
('11/1/2019', 2 , 24),
('13/2/2019', 1 , 16),
('14/2/2019', 2 , 18),
('5/2/2019', 1 , 19),
('6/2/2019', 2 , 23),
('7/2/2019', 1 , 14),
('8/3/2019', 1 , 50),
('8/3/2019', 2 , 50)]
columns = ['date', 'vin', 'mileage']
df=spark.createDataFrame(l, columns)
df = df.withColumn('date',F.to_date(df.date, 'dd/MM/yyyy'))
df = df.withColumn("max", lit(0))
df = df.withColumn("error_code", lit(''))
w = Window.partitionBy('vin').orderBy('date').rangeBetween(Window.unboundedPreceding,0)
df = df.withColumn('max',F.max('mileage').over(w))
df = df.withColumn('error_code', F.when(F.col('mileage') < F.col('max'), F.lit('ERROR')).otherwise(F.lit('')))
df.show()
Finally, all that remains is to remove the column that has the maximum
df = df.drop('max')
df.show()
Upvotes: 0
Reputation: 19355
You achieve that with the lag window function. The lag function returns you the row before the current row. With that you can easily compare the kilometraje values. Have a look at the code below:
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [('1/1/2019' , 1 , 10),
('2/1/2019', 2 , 20 ),
('3/1/2019', 1 , 30 ),
('4/1/2019', 1 , 10 ),
('5/1/2019', 1 , 30 ),
('7/1/2019', 3 , 30 ),
('4/1/2019', 2 , 5)]
columns = ['fecha', 'id', 'kilometraje']
df=spark.createDataFrame(l, columns)
df = df.withColumn('fecha',F.to_date(df.fecha, 'dd/MM/yyyy'))
w = Window.partitionBy('id').orderBy('fecha')
df = df.withColumn('error_km', F.when(F.lag('kilometraje').over(w) > df.kilometraje, F.lit('ERROR') ).otherwise(F.lit('')))
df.show()
Output:
+----------+---+-----------+--------+
| fecha| id|kilometraje|error_km|
+----------+---+-----------+--------+
|2019-01-01| 1| 10| |
|2019-01-03| 1| 30| |
|2019-01-04| 1| 10| ERROR|
|2019-01-05| 1| 30| |
|2019-01-07| 3| 30| |
|2019-01-02| 2| 20| |
|2019-01-04| 2| 5| ERROR|
+----------+---+-----------+--------+
The fourth row doesn't get labeled with 'ERROR' as the previous value had a smaller kilometraje value (10 < 30). When you want to label all the id's with 'ERROR' which contain at least one corrupted row, perform a left join.
df.drop('error_km').join(df.filter(df.error_km == 'ERROR').groupby('id').agg(F.first(df.error_km).alias('error_km')), 'id', 'left').show()
Upvotes: 2