Reputation: 1599
This question is little similar as my previous one but have different requirments.
I am working on a table of pyspark hive.
It is a table of monthly activities of a club members.
Each row is the time of a member take the activity of the club. A member may take the activities on any month.
year month member_id
2010 01 m_453
2010 01 m_972
2010 02 m_453
2010 02 m_109
2010 03 m_453
2010 03 m_721
2010 03 m_109
I need to find the new members, who take the activity in each month. A new member means that the member took the activity at the first time.
e.g.
year month member_id
2010 01 m_453
2010 01 m_972
2010 02 m_109
2010 03 m_721
All the year and month in the above table are the first time when the member take the activity.
My sql :
with q as
(
select a.member_id, min(a.year * 100 + a.month) as min_year_month
from MY_TAB as a
group by a.member_id
)
select q.min_year_month, count(distinct(q.member_id)) as dist_m_id
from q
group by q.min_year_month
order by q.min_year_month
I would like to know if this is correct ? and there are other more efficient DDLs?
thanks
Upvotes: 0
Views: 24
Reputation: 1271191
That method is fine. I would use:
select year, month, member_id
from (select t.*, row_number() over (partition by member_id order by year, month) as seqnum
from t
) t
where seqnum = 1;
Upvotes: 1