Reputation: 75
Using SQL or Pyspark, I want to count the number of unique times in a timestamp across a time frame of 2 months. I want to see the distribution of how often rows are logged to the table. This is because I know there a large proportion of timestamps with the time of 00:00:00, but I want to know how big and the ratio compared to other times.
This query groups and counts the most common datetimes, but I need to exclude the date and only have the time. Apparently, this is not so common thing to do.
select timestamp,
count(*) as count
from table_name
where timestamp between '2021-01-01' and '2021-02-28'
group by 1
order by 2 desc
The SQL/Pyspark is ran on a Spark DB in a Zeppelin Notebook.
Timestamps look like this: 2021-01-01 02:07:55
Upvotes: 0
Views: 124
Reputation: 6082
Depends on type of your timestamp
column, you can extract hour, minute, second if it is TimestampType
(with lpad to add leading zeros), or using regexp_extract if it is StringType
from pyspark.sql import functions as F
# if your ts column has TimestampType
(df
.withColumn('ts', F.col('ts').cast('timestamp')) # my assumption ts is timestamp
.withColumn('time_only', F.concat(
F.lpad(F.hour('ts'), 2, '0'),
F.lit(':'),
F.lpad(F.minute('ts'), 2, '0'),
F.lit(':'),
F.lpad(F.second('ts'), 2, '0')
))
.show()
)
# if your ts column is StringType
(df
.withColumn('ts', F.col('ts').cast('string')) # my assumption ts is string
.withColumn('time_only', F.regexp_extract('ts', '\d{2}:\d{2}:\d{2}', 0))
.show()
)
# +-------------------+---------+
# | ts|time_only|
# +-------------------+---------+
# |2019-01-15 03:00:00| 03:00:00|
# |2019-01-15 20:00:00| 20:00:00|
# |2019-01-15 19:00:00| 19:00:00|
# |2019-01-15 11:00:00| 11:00:00|
# +-------------------+---------+
Upvotes: 1
Reputation: 3015
Maybe something like this?
select
date_format(timestamp, "H m s") as dataTime,
count(*) as count
from table_name
where timestamp between '2021-01-01' and '2021-02-28'
group by date_format(timestamp, "H m s")
order by 2 desc
Not a good idea name fields with reserved words (timestamp
).
From spark documentation.
Upvotes: 1