Reputation: 2564
I have an RDD of date-time and hostname as tuple
and I want to count the unique hostnames by date.
RDD:
X = [(datetime.datetime(1995, 8, 1, 0, 0, 1), u'in24.inetnebr.com'),
(datetime.datetime(1995, 8, 1, 0, 0, 7), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 1, 0, 0, 8), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 2, 0, 0, 8), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 2, 0, 0, 8), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 2, 0, 0, 9), u'ix-esc-ca2-07.ix.netcom.com'),
(datetime.datetime(1995, 8, 3, 0, 0, 10), u'uplherc.upl.com'),
(datetime.datetime(1995, 8, 3, 0, 0, 10), u'slppp6.intermind.net'),
(datetime.datetime(1995, 8, 4, 0, 0, 10), u'piweba4y.prodigy.com'),
(datetime.datetime(1995, 8, 5, 0, 0, 11), u'slppp6.intermind.net')]
DESIRED OUTPUT:
[(datetime.datetime(1995, 8, 1, 0, 0, 1), 2),
(datetime.datetime(1995, 8, 2, 0, 0, 8), 2),
(datetime.datetime(1995, 8, 3, 0, 0, 10), 2),
(datetime.datetime(1995, 8, 4, 0, 0, 10), 1),
(datetime.datetime(1995, 8, 5, 0, 0, 11), 1)]
MY ATTEMPT:
dayGroupedHosts = X.groupBy(lambda x: x[0]).distinct()
dayHostCount = dayGroupedHosts.count()
I am getting an error while performing count
operation. I am new to Spark
and I would like to know the correct and efficient transformation
to achieve such tasks.
Thanks a lot in advance.
Upvotes: 3
Views: 6511
Reputation: 215117
Or convert to a DataFrame and use countDistinct
method:
import pyspark.sql.functions as f
df = spark.createDataFrame(X, ["dt", "hostname"])
df.show()
+-------------------+--------------------+
| dt| hostname|
+-------------------+--------------------+
|1995-08-01 00:00:01| in24.inetnebr.com|
|1995-08-01 00:00:07| uplherc.upl.com|
|1995-08-01 00:00:08| uplherc.upl.com|
|1995-08-02 00:00:08| uplherc.upl.com|
|1995-08-02 00:00:08| uplherc.upl.com|
|1995-08-02 00:00:09|ix-esc-ca2-07.ix....|
|1995-08-03 00:00:10| uplherc.upl.com|
|1995-08-03 00:00:10|slppp6.intermind.net|
|1995-08-04 00:00:10|piweba4y.prodigy.com|
|1995-08-05 00:00:11|slppp6.intermind.net|
+-------------------+--------------------+
df.groupBy(f.to_date('dt').alias('date')).agg(
f.countDistinct('hostname').alias('hostname')
).show()
+----------+--------+
| date|hostname|
+----------+--------+
|1995-08-02| 2|
|1995-08-03| 2|
|1995-08-01| 2|
|1995-08-04| 1|
|1995-08-05| 1|
+----------+--------+
Upvotes: 4
Reputation: 43544
You need to first convert the keys into dates. Then group by the key, and count the distinct values:
X.map(lambda x: (x[0].date(), x[1]))\
.groupByKey()\
.mapValues(lambda vals: len(set(vals)))\
.sortByKey()\
.collect()
#[(datetime.date(1995, 8, 1), 2),
# (datetime.date(1995, 8, 2), 2),
# (datetime.date(1995, 8, 3), 2),
# (datetime.date(1995, 8, 4), 1),
# (datetime.date(1995, 8, 5), 1)]
Upvotes: 4