Feng Chen
Feng Chen

Reputation: 2253

How to create a data frame using pyspark, which includes a lot of columns and date data?

When we use pyspark to create a data frame, we usually use this way:

myItems=spark.createDataFrame([(1,'a'),
                               (2,'c')],
                              ['id','item']) 

Every line is a row in the data frame. But here, I need to create a data frame, including 4 groups and dates. For example:

myItems=spark.createDataFrame([(1,2019-01-01',0),
                               (1,2019-01-02',0),
                               ......
                               (1,2019-01-31',0),
                               (2,'2019-01-01',0),
                               ......
                               (2,'2019-01-31',0),
                               data about group 3,
                               data about group 4],
                              ['group','date','value']) 

I need to generate 4 groups: 1, 2, 3, 4. For every group, there is a date column from 2019-01-01 to 2019-01-31. The values are all 0.

So, I guess I need to use some iterations in here. How can I do it?

Upvotes: 1

Views: 119

Answers (1)

cph_sto
cph_sto

Reputation: 7585

You can create tuples of (group,date,value) iteratively using normal python. The code below is fairly generic and will work for almost all the dates and for as many groups as you want to create. Just change the variables num_of_groups, start_date & end_date so as to align with your requirements.

# Number of groups you want to create, starting with 1.
num_of_groups=4    

# Requisite packages to import for doing the iteration on the dates.    
from datetime import date, timedelta

# Initialize start and end date.
start_date = date(2019, 1, 1)
end_Date = date(2019, 1, 31)
delta = end_Date - start_date

You can use datetime Python library to have dates in date format on which you can do the iteration to get consecutive dates and finally use strftime to convert the date to required string format.

# This will store the list of consecutive dates from start to end date.
list_dates=[]    
for i in range(delta.days + 1):
   list_dates = list_dates + [(start_date + timedelta(i)).strftime("%d-%m-%Y")]
print(list_dates)
   ['01-01-2019','02-01-2019','03-01-2019',.....'29-01-2019','30-01-2019','31-01-2019']

Using nested list comprehensions to create a list of tuples mentioned above.

myValues=[(group,date,0) for group in range(1,num_of_groups+1) for date in list_dates]
print(myValues)
   [(1, '01-01-2019', 0), (1, '02-01-2019', 0), ... (4, '30-01-2019', 0), (4, '31-01-2019', 0)]

Finally, creating the DataFrame.

myItems=spark.createDataFrame(myValues,['group','date','value']) 
myItems.show(200)
+-----+-----------+-----+ 
|group|       date|value| 
+-----+----------+------+ 
|    1| 01-01-2019|    0| 
|    1| 02-01-2019|    0| 
|    1| 03-01-2019|    0| 
|    1| 04-01-2019|    0| 
|    1| 05-01-2019|    0|
.
.
.

|    4| 28-01-2019|    0| 
|    4| 29-01-2019|    0| 
|    4| 30-01-2019|    0| 
|    4| 31-01-2019|    0| 
+-----+-----------+-----+

Upvotes: 2

Related Questions