Reputation: 977
I have a pyspark dataframe df with two existing columns name
and birthdate
for which I want to overwrite the values with random values.
For column name
I want to have a string with a random set of letters with a fixed length (say 10). The string should be randomized for each row so all rows don't get the same string.
for column birthdate
I want a string on format YYYY-MM-DD
. I want each row to have a random value between 1960-01-01
and 2019-01-01
.
How can I achieve this?
Upvotes: 0
Views: 789
Reputation: 56
You could create random Strings with
''.join(random.choice(string.ascii_lowercase) for x in range(size))
and random Dates with
month = random.randint(1, 12)
str(random.randint(1960, 2018)) + '-' + str(month)+'-' + (str(random.randint(1, 28)) if month == 2 else str(random.randint(1, 30)) if month % 2 == 0 else str(random.randint(1, 31)))
don't forget to import random
and import string
.
To create an array with the shape of the dataframe create an numpy.ndarray with the same size
import numpy as np
arr = np.ndarray(2, len(dataframe[0]))
and than give it the right values through an loop
for y in range(len(dataframe[0])):
arr[0, y] = ''.join(random.choice(string.ascii_lowercase) for x in range(size))
month = random.randint(1, 12)
arr[1, y] =str(random.randint(1960, 2018)) + '-' + str(month)+'-' + (str(random.randint(1, 28)) if month == 2 else str(random.randint(1, 30)) if month % 2 == 0 else str(random.randint(1, 31)))
Upvotes: 1