Reputation: 2800
I would like to extract numbers from this vector composed of 15 observations:
rs <- c("\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.0\n (1 rating)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 9 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.7\n (4 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 34 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 3.1\n (5 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 22 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 2.4\n (14 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 2,106 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.3\n (67 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 1,287 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.6\n (3 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 30 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n \n\n \n New\n \n\n\n \n\n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 0.0\n (0 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 8 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n \n\n \n Highest Rated\n \n\n\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.6\n (12 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 42 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.4\n (6 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 41 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.2\n (12 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 115 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.8\n (6 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 25 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.6\n (19 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 151 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.5\n (10 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 385 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 4.8\n (166 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 754 students enrolled\n \n \n\n\n \n\n ",
"\n \n \n \n\n \n \n \n\n \n \n \n \n \n \n \n \n \n \n 3.6\n (34 ratings)\n \n \n \n Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n \n \n \n \n \n \n\n\n \n \n \n\n \n \n 3,396 students enrolled\n \n \n\n\n \n\n "
)
As you can see, there are 15 objects very long and dirt. However, a pattern inside of them is easily identifiable. Every object is composed of 3 numbers (with the example of the first observation):
4.0
(1 rating)
9 students enrolled
.I would like to extract all these numerical values and create a dataframe with 3 columns, each for each variable.
I've been checking several questions here in Stackoverflow, mainly focused on the use of gsub()
of the package stringr
. However, I don't get to find the key solution to my problem.
UPDATE
These are the codes I tried:
as.numeric(str_extract(rs, "[0-9]+"))
as.numeric(str_extract(rs, "[0-9]+")[[1]])
as.numeric(str_extract(rs, "(?<=\\()[0-9]+(?=\\))"))
as.numeric(sapply(strsplit(rs, " "), "[[", 1))
Upvotes: 1
Views: 1964
Reputation: 18701
With extract
from tidyr
, we can do:
library(dplyr)
library(tidyr)
data.frame(rs, stringsAsFactors = FALSE) %>%
extract(rs, c("Rating", "Number_of_ratings", "Students_enrolled"),
"(?s)(\\d\\.\\d).*?(\\d+)\\s*ratings?.*?(\\d+(?:,\\d+)?)\\s*students enrolled",
convert = TRUE) %>%
mutate(Students_enrolled = as.numeric(sub(",", "", Students_enrolled)))
Output:
Rating Number_of_ratings Students_enrolled
1 4.0 1 9
2 4.7 4 34
3 3.1 5 22
4 2.4 14 2106
5 4.3 67 1287
6 4.6 3 30
7 0.0 0 8
8 4.6 12 42
9 4.4 6 41
10 4.2 12 115
11 4.8 6 25
12 4.6 19 151
13 4.5 10 385
14 4.8 166 754
15 3.6 34 3396
Notes:
The regular expression looks complicated, but it's really not. What extract
does is it extracts the match from each capture group (things surrounded by parentheses) and turn them into its own column.
(?s)
is a modifier which turns on the "DOTALL" mode. This allows the dot .
to also match newline characters.
(\\d\\.\\d)
matches the Rating
pattern
(\\d+)\\s*ratings
matches the Number_of_ratings
pattern but only extracts the digits (\\d+)
(\\d+(?:,\\d+)?)\\s*students enrolled
matches the Students_enrolled
pattern, but only extracts the "digits with or without comma" pattern
convert = TRUE
attempts to convert the resulting columns to their best data type, but since there are commas in Students_enrolled
, an extra mutate
is needed to convert it to numeric
Normally, extract
throws an error if the number of capture groups is not equal to the number of output columns, but since modifiers (?s)
and non-capturing groups (?:...)
are not considered capture groups, the capture group count matches the column count.
Upvotes: 3
Reputation: 78842
1-dependency base R solution with a commented, readable regex.
This also shows how to clean up the text for processing (in a way that you can re-use).
library(stringi)
do.call(
rbind.data.frame,
lapply(
stri_match_all_regex(
stri_replace_all_regex(
stri_trim_both(rs), # clean up outer spaces
"[[:blank:][:space:]]+", " " # clean up inner spaces
),
"
([[:digit:]\\.]+)[[:space:]]+\\(([[:digit:],]+)[[:space:]]+rating[s]*\\)# pick up the rating and total number of ratings
[^[:digit:]]*([[:digit:],]+)[[:space:]]+student[s]*[[:space:]]+enrolled # pick up the number of students enrolled
",
opts_regex = stri_opts_regex(comments = TRUE),
),
function(x) {
as.list(
setNames(
x[2:4], c("rating", "n_ratings", "enrolled")
),
stringsAsFactors = FALSE
)
}
)
)
Resulting in:
## rating n_ratings enrolled
## 2 4.0 1 9
## 21 4.7 4 34
## 3 3.1 5 22
## 4 2.4 14 2,106
## 5 4.3 67 1,287
## 6 4.6 3 30
## 7 0.0 0 8
## 8 4.6 12 42
## 9 4.4 6 41
## 10 4.2 12 115
## 11 4.8 6 25
## 12 4.6 19 151
## 13 4.5 10 385
## 14 4.8 166 754
## 15 3.6 34 3,396
Turning ^^ into #'s is pretty basic after that.
Upvotes: 3
Reputation: 1230
So your issue is that it doesn't see the "." as a part of a number since it is in a string. So you need to explicitly find the numbers and decimal point.
Rating <- as.numeric(str_extract(rs, "[0-9]\\.[0-9]"))
NRatings <- str_extract(rs, "\\([0-9]") %>% str_replace("\\(","") %>% as.numeric()
I will let you figure out the last one based on these examples ;)
Upvotes: 2