bautrey
bautrey

Reputation: 75

SQL: Match timestamps with time-only parameter to group and count unique times across multiple days

Using SQL or Pyspark, I want to count the number of unique times in a timestamp across a time frame of 2 months. I want to see the distribution of how often rows are logged to the table. This is because I know there a large proportion of timestamps with the time of 00:00:00, but I want to know how big and the ratio compared to other times.

This query groups and counts the most common datetimes, but I need to exclude the date and only have the time. Apparently, this is not so common thing to do.

select timestamp,
    count(*) as count
from table_name
where timestamp between '2021-01-01' and '2021-02-28'
group by 1
order by 2 desc

The SQL/Pyspark is ran on a Spark DB in a Zeppelin Notebook.

Timestamps look like this: 2021-01-01 02:07:55

Upvotes: 0

Views: 124

Answers (2)

pltc
pltc

Reputation: 6082

Depends on type of your timestamp column, you can extract hour, minute, second if it is TimestampType (with lpad to add leading zeros), or using regexp_extract if it is StringType

from pyspark.sql import functions as F

# if your ts column has TimestampType
(df
    .withColumn('ts', F.col('ts').cast('timestamp')) # my assumption ts is timestamp
    .withColumn('time_only', F.concat(
        F.lpad(F.hour('ts'), 2, '0'),
        F.lit(':'),
        F.lpad(F.minute('ts'), 2, '0'),
        F.lit(':'),
        F.lpad(F.second('ts'), 2, '0')
    ))
    .show()
)

# if your ts column is StringType
(df
    .withColumn('ts', F.col('ts').cast('string')) # my assumption ts is string
    .withColumn('time_only', F.regexp_extract('ts', '\d{2}:\d{2}:\d{2}', 0))
    .show()
)

# +-------------------+---------+
# |                 ts|time_only|
# +-------------------+---------+
# |2019-01-15 03:00:00| 03:00:00|
# |2019-01-15 20:00:00| 20:00:00|
# |2019-01-15 19:00:00| 19:00:00|
# |2019-01-15 11:00:00| 11:00:00|
# +-------------------+---------+

Upvotes: 1

James
James

Reputation: 3015

Maybe something like this?

select 
  date_format(timestamp, "H m s") as dataTime,
  count(*) as count
from table_name
where timestamp between '2021-01-01' and '2021-02-28'
group by date_format(timestamp, "H m s") 
order by 2 desc

Not a good idea name fields with reserved words (timestamp).

From spark documentation.

Upvotes: 1

Related Questions