Reputation: 40909
I am working on hyperparameter tuning of neural networks and going through examples. I came across this code in one example:
train_X, train_Y = sklearn.datasets.make_circles(n_samples=300, noise=.05)
I understand that adding noise has regularization effect on data. Reading the documentation for this tells that it adds guassian noise. However, in above code, I could not understand what does it means to add 0.05
noise in the data. How would this affect data mathematically here?
I tried below code. I could see values changing but could not figure out how, for example, row1 values of x in array 1 changed by adding noise= .05
to corresponding row in array 2 i.e. x_1 here?
np.random.seed(0)
x,y = sklearn.datasets.make_circles()
print(x[:5,:])
x_1,y_1 = sklearn.datasets.make_circles(noise= .05)
print(x_1[:5,:])
Output:
[[-9.92114701e-01 -1.25333234e-01]
[-1.49905052e-01 -7.85829801e-01]
[ 9.68583161e-01 2.48689887e-01]
[ 6.47213595e-01 4.70228202e-01]
[-8.00000000e-01 -2.57299624e-16]]
[[-0.66187208 0.75151712]
[-0.86331995 -0.56582111]
[-0.19574479 0.7798686 ]
[ 0.40634757 -0.78263011]
[-0.7433193 0.26658851]]
Upvotes: 0
Views: 1865
Reputation: 40909
According to the documentation:
sklearn.datasets.make_circles(n_samples=100, *, shuffle=True, noise=None, random_state=None, factor=0.8)
Make a large circle containing a smaller circle in 2d. A simple toy dataset to visualize clustering and classification algorithms.noise: double or None (default=None) Standard deviation of Gaussian noise added to the data.
The statement make_circles(noise=0.05)
means that it is creating random circles with a little bit of variation following a Gaussian distribution, also known as a normal distribution. You should already know that a random Gaussian distribution means that the numbers being generated have some mean and standard definition. In this case, the call make_circles(noise=0.05)
means that the standard deviation is 0.05.
Let's invoke the function, check out its output, and see what's the effect of changing the parameter noise
. I'll borrow liberally from this nice tutorial on generating scikit-learn dummy data.
Let's first call make_circles()
with noise=0.0
and take a look at the data. I'll use a Pandas dataframe so we can see the data in a tabular way.
from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
import pandas as pd
n_samples = 100
noise = 0.00
features, labels = make_circles(n_samples=n_samples, noise=noise)
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
print(df.head())
# x y label
# 0 -0.050232 0.798421 1
# 1 0.968583 0.248690 0
# 2 -0.809017 0.587785 0
# 3 -0.535827 0.844328 0
# 4 0.425779 -0.904827 0
You can see that make_circles
returns data instances where each instance is a point with two features, x and y, and a label. Let's plot them to see how they actually look like.
# Collect the points together by label, either 0 or 1
grouped = df.groupby('label')
colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(7,7))
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.show()
So it looks like it's creating two concentric circles, each with a different label.
Let's increase the noise to noise=0.05
and see the result:
n_samples = 100
noise = 0.05 # <--- The only change
features, labels = make_circles(n_samples=n_samples, noise=noise)
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
grouped = df.groupby('label')
colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(7,7))
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.show()
It looks like the noise is added to each of the x, y coordinates to make each point shift around a little bit. When we inspect the code for make_circles()
we see that the implementation does exactly that:
def make_circles( ..., noise=None, ...):
...
if noise is not None:
X += generator.normal(scale=noise, size=X.shape)
So now we've seen two visualizations of the dataset with two values of noise
. But two visualizations isn't cool. You know what's cool? Five visualizations with the noise increasing progressively by 10x. Here's a function that does it:
def make_circles_plot(n_samples, noise):
assert n_samples > 0
assert noise >= 0
# Use make_circles() to generate random data points with noise.
features, labels = make_circles(n_samples=n_samples, noise=noise)
# Create a dataframe for later plotting.
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
grouped = df.groupby('label')
colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(5, 5))
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points with noise=%f' % noise)
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.tight_layout()
plt.show()
Calling the above function with different values of noise
, it can clearly be seen that increasing this value makes the points move around more, i.e. it makes them more "noisy", exactly as we should expect intuitively.
for noise in [0.0, 0.01, 0.1, 1.0, 10.0]:
make_circles_plot(500, noise)
Upvotes: 4