Create and use column identified customers Pyspark

Question

Once I try my code, I get the following error from the console:

AnalysisException: cannot resolve '1 as identified' given input columns: [LPay_user, spark_catalog.nn_TEAM_es.fact_table.customer_id, identified, HELLO_pay_date, spark_catalog.nn_TEAM_es.fact_table.ticket_id];;

Here is the code used:

start = '2020-10-20'
end = '2021-01-20'

identified = (spark.table(f'nn_TEAM_{country}.fact_table')
                  .filter(f.col('date_key').between(start,end))
                  .filter(f.col('is_HELLO_plus')==1)
                  .filter(f.col('source')=='tickets')
                  .filter(f.col('subtype')=='trx')
                  .filter(f.col('is_trx_ok')==1)
                  .withColumn('week', f.date_format(f.date_sub(f.col('date_key'), 1), 'YYYY-ww'))
                  .withColumn('month', f.date_format(f.date_sub(f.col('date_key'), 1), 'M'))
                  .selectExpr('customer_id','ticket_id','1 as identified')
                 )

output = (identified
          .join(dim_customers,'customer_id','left')
          .withColumn('segment_group',
                      f.when((f.col('HEL_user')==1),'HELLO_user')
                      .when((f.col('HEL_user')==0),'NO_HELLO_user')
                      .when((f.col('1 as identified').isNull()) & (f.col('HELLO_user')==1),'HELLO_user_no_identified')                      
                      )
         .groupby('segment_group')
         .agg(
           f.countDistinct('customer_id').alias('customers'),
           f.countDistinct('ticket_id').alias('tickets')
         ))

As you can see the columns "1 as identified" is selected. So I don't understand why I'm getting this error.

What I'm trying to do is create a segmentation for customers based in the columns "1 as identified" and "HEL_user".

Can someone explain me how to fix this error? Thanks!

mck · Accepted Answer

When you select 1 as identified you're creating a new column called identified which contains all 1. When you want to select this column in the future, you should select identified because the column is called identified, not called 1 as identified.

Create and use column identified customers Pyspark

Answers (1)

Related Questions