염경훈
염경훈

Reputation: 73

how to make spark numbering column by keyword count

I have dataframe with time-keyword base.

enter image description here

I want to numbering by keyword in below conditions.

How to make it??

enter image description here

here is my sample code.

class DataSetTest {
String date;
String keyword;

public DataSetTest(String date, String keyword) {
    this.date = date;
    this.keyword = keyword;
}}


DataSetTest ro1 = new DataSetTest("2020.11.26 11:30:00", "apple");
DataSetTest ro2 = new DataSetTest("2020.11.26 11:31:00", "apple");
DataSetTest ro3 = new DataSetTest("2020.11.26 11:32:00", "apple");
DataSetTest ro4 = new DataSetTest("2020.11.26 11:34:00", "banana");
DataSetTest ro5 = new DataSetTest("2020.11.26 11:35:00", "car");
DataSetTest ro6 = new DataSetTest("2020.11.26 11:37:00", "car");
DataSetTest ro7 = new DataSetTest("2020.11.26 11:39:00", "apple");
DataSetTest ro8 = new DataSetTest("2020.11.26 11:40:00", "banana");
DataSetTest ro9 = new DataSetTest("2020.11.26 11:42:00", "car");
DataSetTest ro10 = new DataSetTest("2020.11.26 11:43:00", "banana");
DataSetTest ro11= new DataSetTest("2020.11.26 11:45:00", "apple");
DataSetTest ro12 = new DataSetTest("2020.11.26 11:46:00", "apple");
DataSetTest ro13 = new DataSetTest("2020.11.26 11:50:00", "apple");
DataSetTest ro14= new DataSetTest("2020.11.26 11:51:00", "banana");
DataSetTest ro15= new DataSetTest("2020.11.26 11:52:00", "car");
DataSetTest ro16= new DataSetTest("2020.11.26 11:54:00", "car");
DataSetTest ro17= new DataSetTest("2020.11.26 11:58:00", "banana");
Dataset<Row> df = session.getSqlContext().createDataFrame(Arrays.asList(ro1, ro2, ro3, ro4, ro5, ro6, ro7, ro8, ro9, ro10, ro11, ro12, ro13, ro14, ro15, ro16, ro17), DataSetTest.class);

WindowSpec windowSpec = Window.orderBy("keyword");
dataset.withColumn("index", row_number().over(windowSpec));

Upvotes: 0

Views: 57

Answers (1)

falcon-le0
falcon-le0

Reputation: 609

For base intervals you can use date_trunc function:

SELECT time_truncated, time, keyword, ROW_NUMBER() OVER(PARTITION BY time_truncated, keyword ORDER BY time) AS RN
FROM
(
  SELECT *, date_trunc('HOUR', time) AS time_truncated
  FROM table
)t

For intervals like '10 minutes' you need a custom date truncation formula:

date_trunc('minute', from_unixtime(to_unix_timestamp(time) - minute(time) % 10 * 60)) as time_truncated

Upvotes: 1

Related Questions