Olivier GSCHWIND
Olivier GSCHWIND

Reputation: 43

How to obtain a dataset that contains only the date that represents the first of each month

I have a dataset like this :

+----------+
|        dt|
+----------+
|2019-08-08|
|2019-08-22|
|2019-08-23|
|2019-08-31|
|2019-08-29|
|2019-08-01|
|2019-08-04|
|2019-08-11|
|2019-08-15|
|2019-09-03|
|2019-08-27|
|2019-08-28|
|2019-08-06|
|2019-09-01|
|2019-08-07|
|2019-08-17|
|2019-09-04|
|2019-08-10|
|2019-08-12|
|2019-09-02|
+----------+

It's a dataset and I obtain like this :

val df_cra = getData(_spark, prefix + use_db, tb_cra)
  .where("to_date(dt, 'yyyy-MM-dd') >= to_date('" + recupDate + "', 'yyyy-MM-dd')")

val mois_usage = df_cra
  .groupBy("dt")

Now, I want to get a dataset (or list) from the first day of the month each time a month is present in my starting dataset. For my example, I hope :

+----------+
|    result|
+----------+
|2019-08-01|
|2019-09-01|
+----------+

Do you have any idea? Thanks.

Upvotes: 0

Views: 82

Answers (2)

chlebek
chlebek

Reputation: 2451

if you are operating on Dataset[Row] you can use spark sql, see example below:

df.groupBy(year('date),month('date)).agg(min('date).as("result")).show()

or

df.createTempView("df")
spark.sql("select min(date) as result from df group by year(date), month(date)").show()
    +----------+
    |    result|
    +----------+
    |2019-08-01|
    |2019-09-01|
    +----------+


Upvotes: 0

marhel
marhel

Reputation: 11

Assuming dates is a list of strings in YYYY-MM-DD format, and you can use java.time, you can do:

import java.time._

dates.map(v => LocalDate.parse(v, format.DateTimeFormatter.ISO_DATE))
    .foldLeft(Set.empty[YearMonth])((yms, dt) => yms + YearMonth.from(dt))
    .map(_.atDay(1))

which gives a Set[java.time.LocalDate] of Set(2019-08-01, 2019-09-01)

Upvotes: 1

Related Questions