stackoverflowuser2010
stackoverflowuser2010

Reputation: 40909

What does it mean to add gaussian noise = 0.05 in scikit-learn make_circle()? How will it affect the data?

I am working on hyperparameter tuning of neural networks and going through examples. I came across this code in one example:

train_X, train_Y = sklearn.datasets.make_circles(n_samples=300, noise=.05)

I understand that adding noise has regularization effect on data. Reading the documentation for this tells that it adds guassian noise. However, in above code, I could not understand what does it means to add 0.05 noise in the data. How would this affect data mathematically here?

I tried below code. I could see values changing but could not figure out how, for example, row1 values of x in array 1 changed by adding noise= .05 to corresponding row in array 2 i.e. x_1 here?

np.random.seed(0)
x,y = sklearn.datasets.make_circles()
print(x[:5,:])

x_1,y_1 = sklearn.datasets.make_circles(noise= .05)
print(x_1[:5,:])

Output:

[[-9.92114701e-01 -1.25333234e-01]
 [-1.49905052e-01 -7.85829801e-01]
 [ 9.68583161e-01  2.48689887e-01]
 [ 6.47213595e-01  4.70228202e-01]
 [-8.00000000e-01 -2.57299624e-16]]

[[-0.66187208  0.75151712]
 [-0.86331995 -0.56582111]
 [-0.19574479  0.7798686 ]
 [ 0.40634757 -0.78263011]
 [-0.7433193   0.26658851]]

Upvotes: 0

Views: 1865

Answers (1)

stackoverflowuser2010
stackoverflowuser2010

Reputation: 40909

According to the documentation:

sklearn.datasets.make_circles(n_samples=100, *, shuffle=True, noise=None, random_state=None, factor=0.8)
Make a large circle containing a smaller circle in 2d. A simple toy dataset to visualize clustering and classification algorithms.

noise: double or None (default=None) Standard deviation of Gaussian noise added to the data.

The statement make_circles(noise=0.05) means that it is creating random circles with a little bit of variation following a Gaussian distribution, also known as a normal distribution. You should already know that a random Gaussian distribution means that the numbers being generated have some mean and standard definition. In this case, the call make_circles(noise=0.05) means that the standard deviation is 0.05.

Let's invoke the function, check out its output, and see what's the effect of changing the parameter noise. I'll borrow liberally from this nice tutorial on generating scikit-learn dummy data.

Let's first call make_circles() with noise=0.0 and take a look at the data. I'll use a Pandas dataframe so we can see the data in a tabular way.

from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
import pandas as pd

n_samples = 100
noise = 0.00

features, labels = make_circles(n_samples=n_samples, noise=noise)
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
print(df.head())
#           x         y  label
# 0 -0.050232  0.798421      1
# 1  0.968583  0.248690      0
# 2 -0.809017  0.587785      0
# 3 -0.535827  0.844328      0
# 4  0.425779 -0.904827      0

You can see that make_circles returns data instances where each instance is a point with two features, x and y, and a label. Let's plot them to see how they actually look like.

# Collect the points together by label, either 0 or 1
grouped = df.groupby('label')

colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(7,7))
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.show()

enter image description here

So it looks like it's creating two concentric circles, each with a different label.

Let's increase the noise to noise=0.05 and see the result:

n_samples = 100
noise = 0.05  # <--- The only change

features, labels = make_circles(n_samples=n_samples, noise=noise)
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))

grouped = df.groupby('label')

colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(7,7))
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.show()

enter image description here

It looks like the noise is added to each of the x, y coordinates to make each point shift around a little bit. When we inspect the code for make_circles() we see that the implementation does exactly that:

def make_circles( ..., noise=None, ...):

    ...
    if noise is not None:
        X += generator.normal(scale=noise, size=X.shape)

So now we've seen two visualizations of the dataset with two values of noise. But two visualizations isn't cool. You know what's cool? Five visualizations with the noise increasing progressively by 10x. Here's a function that does it:

def make_circles_plot(n_samples, noise):

    assert n_samples > 0
    assert noise >= 0

    # Use make_circles() to generate random data points with noise.
    features, labels = make_circles(n_samples=n_samples, noise=noise)

    # Create a dataframe for later plotting.
    df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
    grouped = df.groupby('label')
    colors = {0:'red', 1:'blue'}

    fig, ax = plt.subplots(figsize=(5, 5))

    for key, group in grouped:
        group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
    plt.title('Points with noise=%f' % noise)
    plt.xlim(-2, 2)
    plt.ylim(-2, 2)
    plt.grid()
    plt.tight_layout()
    plt.show()

Calling the above function with different values of noise, it can clearly be seen that increasing this value makes the points move around more, i.e. it makes them more "noisy", exactly as we should expect intuitively.

for noise in [0.0, 0.01, 0.1, 1.0, 10.0]:
    make_circles_plot(500, noise)

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

Upvotes: 4

Related Questions