Reputation: 403
I am seeking to generate a fake dataset for my research using the Faker library. I am unable to link gender and first name of the person. Can I expect some help in this regard? The function is given below.
def faker_categorical(num=1, seed=None):
np.random.seed(seed)
fake.seed_instance(seed)
output = [
{
"gender": np.random.choice(["M", "F"], p=[0.5, 0.5]),
"GivenName": fake.first_name_male() if "gender"=="M" else fake.first_name_female(),
"Surname": fake.last_name(),
"Zipcode": fake.zipcode(),
"Date of Birth": fake.date_of_birth(),
"country": np.random.choice(["United Kingdom", "France", "Belgium"]),
}
for x in range(num)
]
return output
df = pd.DataFrame(faker_categorical(num=1000))
Upvotes: 4
Views: 6838
Reputation: 306
There is a piece of research in classification linking a name to a Gender,for example John is 99.8% male,and Maria is 99.8% female. You can read it here and can also download a .csv
file which maps different names to genders. What I did when I needed fake data about people was parse the dataset and if the value was there I assigned the classified gender,if it wasn't (Because of locals or something else) I just assigned a np.random.choice(["MALE", "FEMALE"])
. Hope this helped
Upvotes: 1
Reputation: 189638
Your question is unclear, but I guess what you are looking for is a way to refer to the result from np.random.choice()
from two different places in your code. Easy -- assign it to a temporary variable, then refer to that variable from both places.
def faker_categorical(num=1, seed=None):
np.random.seed(seed)
fake.seed_instance(seed)
output = []
for x in range(num):
gender = np.random.choice(["M", "F"], p=[0.5, 0.5])
output.append(
{
"gender": gender,
"GivenName": fake.first_name_male() if gender=="M" else fake.first_name_female(),
"Surname": fake.last_name(),
"Zipcode": fake.zipcode(),
"Date of Birth": fake.date_of_birth(),
"country": np.random.choice(["United Kingdom", "France", "Belgium"]),
})
return output
Upvotes: 5