Reputation: 2253
When we use pyspark to create a data frame, we usually use this way:
myItems=spark.createDataFrame([(1,'a'),
(2,'c')],
['id','item'])
Every line is a row in the data frame. But here, I need to create a data frame, including 4 groups and dates. For example:
myItems=spark.createDataFrame([(1,2019-01-01',0),
(1,2019-01-02',0),
......
(1,2019-01-31',0),
(2,'2019-01-01',0),
......
(2,'2019-01-31',0),
data about group 3,
data about group 4],
['group','date','value'])
I need to generate 4 groups: 1, 2, 3, 4. For every group, there is a date column from 2019-01-01 to 2019-01-31. The values are all 0.
So, I guess I need to use some iterations in here. How can I do it?
Upvotes: 1
Views: 119
Reputation: 7585
You can create tuples
of (group,date,value)
iteratively using normal python. The code below is fairly generic and will work for almost all the dates and for as many groups as you want to create. Just change the variables num_of_groups
, start_date
& end_date
so as to align with your requirements.
# Number of groups you want to create, starting with 1.
num_of_groups=4
# Requisite packages to import for doing the iteration on the dates.
from datetime import date, timedelta
# Initialize start and end date.
start_date = date(2019, 1, 1)
end_Date = date(2019, 1, 31)
delta = end_Date - start_date
You can use datetime
Python library to have dates in date format
on which you can do the iteration to get consecutive dates and finally use strftime
to convert the date
to required string
format.
# This will store the list of consecutive dates from start to end date.
list_dates=[]
for i in range(delta.days + 1):
list_dates = list_dates + [(start_date + timedelta(i)).strftime("%d-%m-%Y")]
print(list_dates)
['01-01-2019','02-01-2019','03-01-2019',.....'29-01-2019','30-01-2019','31-01-2019']
Using nested list comprehensions
to create a list of tuples mentioned above.
myValues=[(group,date,0) for group in range(1,num_of_groups+1) for date in list_dates]
print(myValues)
[(1, '01-01-2019', 0), (1, '02-01-2019', 0), ... (4, '30-01-2019', 0), (4, '31-01-2019', 0)]
Finally, creating the DataFrame.
myItems=spark.createDataFrame(myValues,['group','date','value'])
myItems.show(200)
+-----+-----------+-----+
|group| date|value|
+-----+----------+------+
| 1| 01-01-2019| 0|
| 1| 02-01-2019| 0|
| 1| 03-01-2019| 0|
| 1| 04-01-2019| 0|
| 1| 05-01-2019| 0|
.
.
.
| 4| 28-01-2019| 0|
| 4| 29-01-2019| 0|
| 4| 30-01-2019| 0|
| 4| 31-01-2019| 0|
+-----+-----------+-----+
Upvotes: 2