Kevin
Kevin

Reputation: 3239

Python - Create a data set with correlating numeric variables

I want to create a dataset where I have years of experience from 1 to 10 and have salary from 30k to 100k. I want these salaries to be random and to follow the years of experience. Sometimes a person with more experience may make less than a person with less experience.

For example:

years of experience | Salary
1                   | 30050
2                   | 28500
3                   | 36000
...
10                  | 100,500

Here is what I have done so far:

import numpy as np
import random
import pandas as pd
years = np.linspace(1.0, 10.0, num=10)
salary = np.linspace(30000.0, 100000.0, num=10) + random.uniform(-1,1)*5000#plus/minus 5k
data = pd.DataFrame({'experience' : years, 'salary': salary})
print (data)

Which gives me:

   experience         salary
0         1.0   31060.903965
1         2.0   38838.681742
2         3.0   46616.459520
3         4.0   54394.237298
4         5.0   62172.015076
5         6.0   69949.792853
6         7.0   77727.570631
7         8.0   85505.348409
8         9.0   93283.126187
9        10.0  101060.903965

we can see that we do not get some records where a person with higher experience made less than a person with lower experience. How can I fix this? Of course I want to scale this to give me 1000 rows

Upvotes: 4

Views: 1795

Answers (5)

ksbg
ksbg

Reputation: 3284

scikit-learn comes with some useful functions to generate correlated data, such as make_regression.

You could for example do:

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression

np.random.seed(0)
n_samples = 1000

X, y = make_regression(n_samples=n_samples, n_features=1, n_informative=1, 
noise=80, random_state=0)

# Scale X (years of experience) to 0..10 range
X = np.interp(X, (X.min(), X.max()), (0, 10))

# Scale y (salary) to 30000..100000 range
y = np.interp(y, (y.min(), y.max()), (30000, 100000))

# To dataframe
df = pd.DataFrame({'experience': X.flatten(), 'salary': y}
print(df.head(10))

From what you describe, it seems as though you want to add some variance to the response. This can be done by adjusting the noise parameter. Let's plot it to make it more obvious:

from matplotlib import pyplot as plt

plt.scatter(X, y, color='blue', marker='.', label='Salary')
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.show()

For example, using noise=80: enter image description here

Or using noise=250: enter image description here

As a side note: This generates continuous values for "years of experience". If you instead want them rounded to integers, you can do that using X = np.rint(X)

Upvotes: 4

Franklinn Bincom
Franklinn Bincom

Reputation: 11

import numpy as np
import random
import pandas as pd
years = np.linspace(1.0, 10.0, num=10)
salary = np.random.randint(30000.0, 100000.0, 10)
data = pd.DataFrame({'experience' : years, 'salary': salary})
print (data)

Upvotes: 0

Tom Cornebize
Tom Cornebize

Reputation: 1422

You can define the salary to be equal to the number of years times some coefficient, plus some constant value, plus some random value.

import numpy as np
import random
import pandas as pd
N = 1000
intercept = 30000
coeff = 7000
years = np.random.uniform(low=1, high=10, size=N)
salary = intercept + years*coeff + np.random.normal(loc=0, scale=10000, size=N)
data = pd.DataFrame({'experience' : years, 'salary': salary})
data.plot.scatter(x='experience', y='salary', alpha=0.3)

enter image description here

Upvotes: 2

Domenico Vito Scalera
Domenico Vito Scalera

Reputation: 43

In this case I would change the line:

salary = np.linspace(30000.0, 100000.0, num=10) + random.uniform(-1,1)*5000#plus/minus 5k

I think it is better to have the random section apart, in this way you can change that easly and make all your modification depending on the values you want to reach.

Here is something I did:

import numpy as np
import random
import pandas as pd
years = np.linspace(1.0, 10.0, num=10)
random_list = [random.random()*1000*_*5 for _ in range(10)]
print(random_list)
salary = np.linspace(30000.0, 100000.0, num=10)- random_list
data = pd.DataFrame({'experience' : years, 'salary': salary})
print (data)

The random components has more variance, when the salary grows.

Upvotes: 1

FloMei
FloMei

Reputation: 66

random.uniform(-1,1)*5000 means that your salary value will be changed by a range of -5k to +5k, but as uniform is continuous in its output, it could well be that the salary is changed by a very small amount. seeing how the salary without the random element changes by 7777.77... per step up in experience, it is quite unlikely to get a lower salary for a higher experience. i would suggest you increase the factor behind your random element. try random.uniform(-1,1) * 10000 for example. how high you crank that randomness is up to you, depends on how likely it should be to get an overpaid inexperienced person.

Upvotes: 0

Related Questions