Reputation: 1299
My R experience is pretty limited. I'm working on some textual analysis of ~11,000 survey comments. I'm guided primarily by the Silge & Robinson "Text Mining with R" book. Anyway....
There are several different locations in the dataset and I have split the data into a number of frames representing "Location_X" and "Not_X", "Location_Y" and "Not_Y" etc. I've then calculated the relative frequency of words (starting with individual words) and wind up with a dataframe named scatter_frequency that looks like
+---------------+--------------+--------------+
| word | location_x | not_x |
+---------------+--------------+--------------+
| acceptance | 1.538130e-04 | 8.972231e-05 |
| accepted | 1.076691e-04 | 1.794446e-04 |
| accepting | 1.768850e-04 | 1.794446e-04 |
| access | 8.305903e-04 | 8.075008e-04 |
| accessible | 1.461224e-04 | 4.486115e-05 |
| accident | 7.690651e-06 | 4.486115e-05 |
| accolades | 7.690651e-06 | 4.486115e-05 |
| accommodate | 2.307195e-05 | 4.486115e-05 |
| accommodating | 1.538130e-05 | 4.486115e-05 |
| accomplish | 4.460578e-04 | 7.626396e-04 |
| accomplished | 3.614606e-04 | 3.140281e-04 |
+---------------+--------------+--------------+
and so on for ~4,000 rows
I then plot
ggplot(scatter_frequency, aes(x=location_x, y=not_x)) +
geom_abline(color="gray40", lty=2) +
geom_jitter(alpha=0.1, size=2.5, width=0.3, height=0.3) +
geom_text(aes(label=word), check_overlap = TRUE, vjust=1.5) +
scale_x_log10(labels=percent_format()) +
scale_y_log10(labels=percent_format()) +
scale_color_gradient(limits=c(0, 0.001),
low="darkslategray4", high="gray75") +
theme(legend.position = "none") +
labs(x="Location X", y="Not X")
and produce this plot
you can see where I blurred out some identifying terms, but this is pretty representative.
So far so good...we can now see which terms appeared frequently (further to the right) and more frequently in one data set than the other (further away from the line). What interesting are the terms that appear furthest from the line, as they are either conspicuously common or uncommon at location x. The terms near the line aren't all that interesting. This was a survey on management, so it's no surprise "leadership" and "management" appear. But the fact that "abusive" is much more common at location x than the other locations IS interesting. And I'd like to know what word corresponds to the dot that is well off the line below and to the left of "shop"
So my question is, is there a programmatic way to restrict labeling to those "interesting" points? As in, choose which point are labeled based on their distance from the line?
This may not be the best formed question...thanks in advance for your patience.
Upvotes: 1
Views: 1532
Reputation: 1383
As we discussed yesterday, with the values for slope and intercept you can add a column with the abline values :
scatter_frequency$reg = slope * not_x + intercept
Then choose the distance from the line value you would find interesting and make a subset of your data that has that distance or more like :
minDist = 0.2
labeledPoints = subset(scatter_frequency, abs(scatter_frequency$not_x - scatter_frequency$reg)>minDist)
Then use that subset with geom_text for your labels :
geom_text(data = labeledPoints,aes(label=name), check_overlap = TRUE, vjust=1.5)
you could also directly make a column that is the distance from the line and make the subset in geom_test with it:
scatter_frequency$dist = abs(scatter_frequency$not_x - (slope * not_x + intercept))
geom_text(data = subset(scatter_frequency, scatter_frequency$dist > minDist),aes(label=name), check_overlap = TRUE, vjust=1.5)
Upvotes: 1
Reputation: 24139
Here is a solution with calculates the shortest distance from the point to line and then filters out those points greater than a chosen threshold.
library(ggplot2)
library(scales)
#define the distance formula from a point to the line
#. line has the slope of 1 and intercept of 0
dist<-abs(scatter_frequency$location_x - scatter_frequency$not_x)/sqrt(2)
#determine thershold of distance to plot
toplot <-which(dist>3e-5)
#Edit the geom_text option to use the reduced dataset of labels.
ggplot(scatter_frequency, aes(x=location_x, y=not_x)) +
geom_abline(color="gray40", lty=2) +
geom_point(alpha=0.1, size=2.5) +
geom_text(data=scatter_frequency[toplot,], aes(x=location_x, y=not_x, label=word), check_overlap = TRUE, vjust=1.5) +
scale_x_log10(labels=percent_format()) +
scale_y_log10(labels=percent_format()) +
scale_color_gradient(limits=c(0, 0.001),
low="darkslategray4", high="gray75") +
theme(legend.position = "none") +
labs(x="Location X", y="Not X")
Which labels are plot doesn't appear correct but that is due to using log-log scale.
Upvotes: 1
Reputation: 316
Nice problem.
You should have included the packages you are using, to make the example complete.
Your abline
is the identity line, so the points you consider interesting are those where the absolute value of the difference between the x
and the y
coordinates are above a certain threshold.
You are using geom_jitter
, but that interferes with the labeling done by geom_text_repel
, which I decided to use to avoid overlapping and to produce line segments connecting labels to points. So I use geom_point
instead.
When you apply this to your entire dataset, you might have to experiment with the arguments nudge_x
, nudge_y
, force
, max.iter
and others of geom_text_repel
. Check the docs.
Here is the code:
library(tidyverse)
library(ggrepel)
library(scales)
#>
#> Attaching package: 'scales'
#> The following object is masked from 'package:purrr':
#>
#> discard
#> The following object is masked from 'package:readr':
#>
#> col_factor
scatter_frequency <- tibble(
word = c(
'acceptance',
'accepted',
'accepting',
'access',
'accessible',
'accident',
'accolades',
'accommodate',
'accommodating',
'accomplish',
'accomplished'
),
location_x = c(
1.538130e-04,
1.076691e-04,
1.768850e-04,
8.305903e-04,
1.461224e-04,
7.690651e-06,
7.690651e-06,
2.307195e-05,
1.538130e-05,
4.460578e-04,
3.614606e-04
),
not_x = c(
8.972231e-05,
1.794446e-04,
1.794446e-04,
8.075008e-04,
4.486115e-05,
4.486115e-05,
4.486115e-05,
4.486115e-05,
4.486115e-05,
7.626396e-04,
3.140281e-04
)
)
# Select n points most distant from the line
n <- 5
important <- scatter_frequency %>%
mutate(lsqd = (abs(log10(location_x) - log10(not_x)))) %>%
top_n(n, wt = lsqd)
ggplot(scatter_frequency, aes(x=location_x, y=not_x)) +
geom_abline(color="gray40", lty=2) +
geom_point(alpha=0.1, size=2.5) +
geom_text_repel(
data = important,
aes(label = word),
min.segment.length = 0,
# nudge_x = -.5,
# nudge_y = .5,
force = 50,
max.iter = 5000
) +
scale_x_log10(limits = c(.000001, .01), labels=percent_format()) +
scale_y_log10(limits = c(.000001, .01), labels=percent_format()) +
scale_color_gradient(limits=c(0, 0.001),
low="darkslategray4", high="gray75") +
theme(legend.position = "none") +
labs(x="Location X", y="Not X")
Created on 2019-12-13 by the reprex package (v0.3.0)
Upvotes: 1