Juan
Juan

Reputation: 171

Validate data from the same column in different rows with pyspark

How can I change the value of a column depending on some validation between some cells? What I need is to compare the kilometraje values of each customer's (id) record to compare whether the record that follows the kilometraje is higher.

fecha      id   estado  id_cliente  error_code  kilometraje error_km
1/1/2019    1     A         1                       10  
2/1/2019    2     A                    ERROR        20  
3/1/2019    1     D         1          ERROR        30
4/1/2019    2     O                                          ERROR

The error in the error_km column is because for customer (id) 2 the kilometraje value is less than the same customer record for 2/1/2019 (If time passes the car is used so the kilometraje increases, so that there is no error the mileage has to be higher or the same)

I know that withColumn I can overwrite or create a column that doesn't exist and that using when I can set conditions. For example: This would be the code I use to validate the estado and id_cliente column and ERROR overwrite the error_code column where applicable, but I don't understand how to validate between different rows for the same client.

from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col

file_path = 'archive.txt'

error = 'ERROR'

df = spark.read.parquet(file_path)
df = df.persist(StorageLevel.MEMORY_AND_DISK)
df = df.select('estado', 'id_cliente')
df = df.withColumn("error_code", lit(''))

df = df.withColumn('error_code',
                            F.when((F.col('status') == 'O') &
                                    (F.col('client_id') != '') |
                                    (F.col('status') == 'D') &
                                    (F.col('client_id') != '') |
                                    (F.col('status') == 'A') &
                                    (F.col('client_id') == ''),
                                     F.concat(F.col("error_code"), F.lit(":[{}]".format(error)))
                                   )
                             .otherwise(F.col('error_code')))

Upvotes: 0

Views: 697

Answers (2)

Juan
Juan

Reputation: 171

I use .rangeBetween(Window.unboundedPreceding,0).

This function searches from the current value for the added value for the back

import pyspark
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

error = 'This is error'

l = [('1/1/2019' , 1      , 10),
('2/1/2019', 2     , 20  ),
('3/1/2019', 1      , 30  ),
('4/1/2019', 1      , 10  ),
('5/1/2019', 1      , 22  ),
('7/1/2019', 1      , 23  ),
('22/1/2019', 2      , 5),
('11/1/2019', 2      , 24),
('13/2/2019', 1      , 16),
('14/2/2019', 2      , 18),
('5/2/2019', 1      , 19),
('6/2/2019', 2      , 23),
('7/2/2019', 1      , 14),
('8/3/2019', 1      , 50),
('8/3/2019', 2      , 50)]

columns = ['date', 'vin', 'mileage']

df=spark.createDataFrame(l, columns)
df = df.withColumn('date',F.to_date(df.date,  'dd/MM/yyyy'))
df = df.withColumn("max", lit(0))
df = df.withColumn("error_code", lit(''))

w = Window.partitionBy('vin').orderBy('date').rangeBetween(Window.unboundedPreceding,0)

df = df.withColumn('max',F.max('mileage').over(w))
df = df.withColumn('error_code', F.when(F.col('mileage') < F.col('max'), F.lit('ERROR')).otherwise(F.lit('')))

df.show()

enter image description here

Finally, all that remains is to remove the column that has the maximum

df = df.drop('max')
df.show()

enter image description here

Upvotes: 0

cronoik
cronoik

Reputation: 19355

You achieve that with the lag window function. The lag function returns you the row before the current row. With that you can easily compare the kilometraje values. Have a look at the code below:

import pyspark.sql.functions as F
from pyspark.sql import Window

l = [('1/1/2019' , 1      , 10),
('2/1/2019', 2     , 20  ),
('3/1/2019', 1      , 30  ),
('4/1/2019', 1      , 10  ),
('5/1/2019', 1      , 30  ),
('7/1/2019', 3      , 30  ),
('4/1/2019', 2      , 5)]

columns = ['fecha', 'id', 'kilometraje']

df=spark.createDataFrame(l, columns)
df = df.withColumn('fecha',F.to_date(df.fecha,  'dd/MM/yyyy'))

w = Window.partitionBy('id').orderBy('fecha')

df = df.withColumn('error_km', F.when(F.lag('kilometraje').over(w) > df.kilometraje, F.lit('ERROR') ).otherwise(F.lit('')))

df.show()

Output:

+----------+---+-----------+--------+ 
|     fecha| id|kilometraje|error_km| 
+----------+---+-----------+--------+ 
|2019-01-01|  1|         10|        | 
|2019-01-03|  1|         30|        | 
|2019-01-04|  1|         10|   ERROR| 
|2019-01-05|  1|         30|        | 
|2019-01-07|  3|         30|        | 
|2019-01-02|  2|         20|        | 
|2019-01-04|  2|          5|   ERROR| 
+----------+---+-----------+--------+

The fourth row doesn't get labeled with 'ERROR' as the previous value had a smaller kilometraje value (10 < 30). When you want to label all the id's with 'ERROR' which contain at least one corrupted row, perform a left join.

df.drop('error_km').join(df.filter(df.error_km == 'ERROR').groupby('id').agg(F.first(df.error_km).alias('error_km')), 'id', 'left').show()

Upvotes: 2

Related Questions