add columns with random values to pyspark dataframe

Question

I have a pyspark dataframe df with two existing columns name and birthdate for which I want to overwrite the values with random values.

For column name I want to have a string with a random set of letters with a fixed length (say 10). The string should be randomized for each row so all rows don't get the same string.

for column birthdate I want a string on format YYYY-MM-DD. I want each row to have a random value between 1960-01-01 and 2019-01-01.

How can I achieve this?

CG poly · Accepted Answer

You could create random Strings with

''.join(random.choice(string.ascii_lowercase) for x in range(size))

and random Dates with

month = random.randint(1, 12)
str(random.randint(1960, 2018)) + '-' + str(month)+'-' + (str(random.randint(1, 28)) if month == 2 else str(random.randint(1, 30)) if month % 2 == 0 else str(random.randint(1, 31)))

don't forget to import random and import string.

To create an array with the shape of the dataframe create an numpy.ndarray with the same size

import numpy as np
arr = np.ndarray(2, len(dataframe[0]))

and than give it the right values through an loop

for y in range(len(dataframe[0])):
    arr[0, y] = ''.join(random.choice(string.ascii_lowercase) for x in range(size))
    month = random.randint(1, 12)
    arr[1, y] =str(random.randint(1960, 2018)) + '-' + str(month)+'-' + (str(random.randint(1, 28)) if month == 2 else str(random.randint(1, 30)) if month % 2 == 0 else str(random.randint(1, 31)))

add columns with random values to pyspark dataframe

Answers (1)

Related Questions