Reputation: 19
I have tried to find random points on the NxM dataset based on the lowest value of each M as low range and the highest value of each M on as high range.
Here is the code:
def generate_random_points(dataset, dimension_based=False):
dimension = dataset.shape[1]
if dimension_based == False:
row_size = np.floor((np.sqrt(dimension))).astype(int) if np.floor(np.sqrt(dimension)).astype(int) < np.floor(np.sqrt(dataset.shape[0])).astype(int) else np.floor((np.sqrt(dataset.shape[0]))).astype(int)
generated_spikes = np.random.uniform(low=np.min(dataset, axis=0),
high=np.max(dataset, axis=0),
size=(row_size, dimension))
return generated_spikes
else:
row_size = np.floor((np.sqrt(dimension))).astype(int)
generated_spikes = np.random.uniform(low=np.min(dataset, axis=0),
high=np.max(dataset, axis=0),
size=(row_size, dimension))
return generated_spikes
But the problem is most of the random points lies on the boundaries or edges of dataset spaces rather than being uniformly and evenly distributed
Here is a plot of one example: random points are black ones
I have also tried doing PCA and then apply the high and low range by doing inverse_transform to the ranges but kind of expectedly, the random points are not distributed uniformly and evenly
def generate_random_points(dataset,dimension_based= False):
dimension = dataset.shape[1]
dimension_pca = dataset.shape[0] if dataset.shape[0] < dataset.shape[1] else dataset.shape[1]
pca, dataset_pca = perform_PCA(dimension_pca, dataset)
low_pca = np.min(dataset_pca, axis=0)
high_pca = np.max(dataset_pca, axis=0)
low = perform_PCA_inverse(pca, low_pca)
high = perform_PCA_inverse(pca, high_pca)
if dimension_based == False:
row_size = np.floor((np.sqrt(dimension))).astype(int) if np.floor(np.sqrt(dimension)).astype(int) < np.floor(np.sqrt(dataset.shape[0])).astype(int) else np.floor((np.sqrt(dataset.shape[0]))).astype(int)
generated_spikes = np.random.uniform(low=low,
high=high,
size=(row_size, dimension))
return generated_spikes
else:
row_size = np.floor((np.sqrt(dimension))).astype(int)
generated_spikes = np.random.uniform(low=np.min(dataset, axis=0),
high=np.max(dataset, axis=0),
size=(row_size, dimension))
return generated_spikes
How to solve the issue such that the random generated points are more evenly distributed instead of piling up on two edges and also do not overlap?
I need like this:
the red one is the position required for the black points which are crossed
P.S:
Both of the image is a PCA representation of a dataset with shape of (46,2730) i.e. 46 rows and 2730 dimensions
I was thinking of using the 2nd answer of this question : algorithm for generating uniformly distributed random points on the N-sphere But I am not sure how to calculate the radius(R) of an N-dimensional dataset or even if it make sense so that I can use that 2nd answer on the link above.
Please help!
Upvotes: 1
Views: 821
Reputation: 2744
To better understand the question and give some hints on possible causes of your problem, I post this message which cannot fit into a comment.
Let me use my own words to explain your problem and please correct me or your answer to make your case more clear.
You are given N_1 and N_2 number of points in an M dimensional space. Maybe your points in each set are normally distributed in the M dimensional space, e.g. if you create it with make_blobs. Then you identify the minimum values x_{i,min,1} and maximum values x_{i,max,1} for each dimension x_i for each point in the set N_1. Then you generate random points in the M dimensional space within the M-dimensional rectangle restricted in the range
[x_{1,min,1},x_{1,max,1}] x [x_{2,min,1},x_{2,max,1}] x ... x [x_{M,min,1},x_{M,max,1}]
Then you apply PCA and plot the 2 principal components. Your observation is that your random points are not uniformly distributed within the range where your data lies.
If your data follows an M-dimensional normal distribution (in this example, M=2), the minimum and maximum values can lie a couple of times further than the standard deviation. When you generate random points within the minimum and maximum values, your random points will evenly represent the ranges where you barely have data points. Take the following as an example. It generates 10'000 data points with a normal distribution in 2D, and then generates 5 further points with uniform distribution in the rectangle drawn around the data points.
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(3)
x_data = np.random.normal(size=10000)
x_min = x_data.min()
x_max = x_data.max()
y_data = np.random.normal(size=10000)
y_min = y_data.min()
y_max = y_data.max()
random_x = np.random.uniform(x_min, x_max, size=5)
random_y = np.random.uniform(y_min, y_max, size=5)
fig, ax = plt.subplots()
ax.plot(x_data[:10000], y_data[:10000], "o",
label="data points with normal distribution")
ax.plot(random_x, random_y, "o", label="random points with uniform distribution")
ax.legend()
plt.show()
The output of the code is shown below:
Although the random points are uniformly distributed, one may think they are only at the edges of the distribution. From some point of view, the situation in higher dimensions just gets worse. Imagine the unit M-dimensional sphere and cube. The ratio of the volume of the sphere and the volume of the cube tends to 0, meaning that if you generate random points in the unit cube, whereas your data is (mainly) located within the unit sphere, then the ratio of your random points outside the area of your data points tends to 1. However, if you simply drop the extra dimensions with PCA, you cannot see this completely in the 2D plot.
If I understood your problem correctly and the problem is just an illusion, please rephrase your question accordingly so that others can address your specific request.
If you want your random points to better reflect the distribution properties of your data, you need to set up a model on your data, e.g. it is normally distributed data. Identify the mean and std, and generate random points using a distribution with that properties.
the red one is the position required for the black points which are crossed" Could you please replot your figure, provide more examples and rephrase the legend?
Upvotes: 0