find new values in a column of a pyspark hive table by time point in a time series data

Question

This question is little similar as my previous one but have different requirments.

I am working on a table of pyspark hive.

It is a table of monthly activities of a club members.

Each row is the time of a member take the activity of the club. A member may take the activities on any month.

  year month member_id
  2010 01    m_453
  2010 01    m_972
  2010 02    m_453
  2010 02    m_109
  2010 03    m_453
  2010 03    m_721
  2010 03    m_109

I need to find the new members, who take the activity in each month. A new member means that the member took the activity at the first time.

e.g.

  year month member_id
  2010 01    m_453
  2010 01    m_972
  2010 02    m_109
  2010 03    m_721

All the year and month in the above table are the first time when the member take the activity.

My sql :

 with q as 
 (
    select a.member_id, min(a.year * 100 + a.month) as min_year_month
    from MY_TAB as a
    group by a.member_id
  )
 select q.min_year_month, count(distinct(q.member_id)) as dist_m_id
 from q
 group by q.min_year_month
 order by q.min_year_month

I would like to know if this is correct ? and there are other more efficient DDLs?

thanks

Gordon Linoff · Accepted Answer

That method is fine. I would use:

select year, month, member_id
from (select t.*, row_number() over (partition by member_id order by year, month) as seqnum
      from t
     ) t
where seqnum = 1;

find new values in a column of a pyspark hive table by time point in a time series data

Answers (1)

Related Questions