Reputation: 5059
I have a tab separated dataset that looks like this
Labels t1 t2 t3
gene1 0.000000E+00 0.000000E+00 1.138501E-01
gene2 0.000000E+00 0.000000E+00 9.550272E-02
gene3 0.000000E+00 1.851936E-02 1.019907E-01
gene4 8.212816E-02 0.000000E+00 6.570984E+00
gene5 1.282434E-01 0.000000E+00 6.240799E+00
gene6 2.918929E-01 8.453281E-01 3.387610E+00
gene7 0.000000E+00 1.923038E-01 0.000000E+00
gene8 1.135057E+00 0.000000E+00 2.491100E+00
gene9 7.935625E-01 1.070320E-01 2.439292E+00
gene10 5.046790E+00 0.000000E+00 2.459273E+00
gene11 3.293614E-01 0.000000E+00 2.380152E+00
gene12 0.000000E+00 0.000000E+00 1.474757E-01
gene13 0.000000E+00 0.000000E+00 1.521591E-01
gene14 0.000000E+00 9.968809E-02 8.387166E-01
gene15 0.000000E+00 1.065761E-01 0.000000E+00
What I want: is to get a 3d scatterplot with labels of outliers, like this:
What I have done: in R
I have actually read each column individually like this:
library("scatterplot3d")
temp<-read.table("tempdata.txt", header=T)
scatterplot3d(temp1$t1, temp1$t2, temp1$t3)
What I want: is that the labels of outliers should be displayed atleast for the top 250 or how can I get these labels of top 250 outliers in a variable for further analysis.
Could anyone please guide me through this in R.
The solution in python are also welcome.
Upvotes: 0
Views: 1493
Reputation: 46530
Here it is in matplotlib:
import numpy as np
from matplotlib import pyplot, cm
from mpl_toolkits.mplot3d import Axes3D
data = np.genfromtxt('genes.txt', usecols=range(1,4))
N = len(data)
nout = N/4 # top 25% in magnitude
outliers = np.argsort(np.sqrt(np.sum(data**2, 1)))[-nout:]
outlies = np.zeros(N)
outlies[outliers] = 1 # now an array of 0 or 1, depending on whether an outlier
fig = pyplot.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(*data.T, c=cm.jet(outlies)) # color by whether outlies.
pyplot.show()
Here it is, red are far from origin, blue nearby:
Upvotes: 1
Reputation: 1492
Plotting 250 labels into a plot is not a good choice since it will make the plot impossible to read. If you want to label outliers in your plot these should be far away from the rest of your data points to easily identify them uniquely. You can however save the largest 250 zz values and their corresponding labels in a matrix for further analysis. I would do something like this:
# Create some random data
library("scatterplot3d")
temp1 <- as.data.frame(matrix(rnorm(900), ncol=3))
temp1$labels <- c("gen1", "gen2", "gen3")
colnames(temp1) <- c("t1", "t2", "t3", "labels")
# get the outliers
zz.outlier <- sort(temp1$t3, TRUE)[1:5]
ix <- which(temp1$t3 %in% zz.outlier)
outlier.matrix <- temp1[ix, ]
# create the plot and mark the points
sd3 <- scatterplot3d(temp1$t1, temp1$t2, temp1$t3)
sd3$points3d(temp1$t1[ix],temp1$t2[ix],temp1$t2[ix], col="red")
text(sd3$xyz.convert(temp1$t1[ix],temp1$t2[ix],temp1$t2[ix]),
labels=temp1$labels[ix])
Here I also marked the points with a red color. This would allow you to mark a slightly larger amount of outliers than using text labels while still keeping the plot fairly accessible. It will however also fail if there are multiple points in close proximity.
Upvotes: 1