antecessor
antecessor

Reputation: 2800

Extracting numbers with decimals from large strings in R

I would like to extract numbers from this vector composed of 15 observations:

rs <- c("\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.0\n                    (1 rating)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            9 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.7\n                    (4 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            34 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    3.1\n                    (5 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            22 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    2.4\n                    (14 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            2,106 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.3\n                    (67 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            1,287 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.6\n                    (3 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            30 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n                \n                    \n\n    \n        New\n    \n\n\n                \n\n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    0.0\n                    (0 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            8 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n                \n                    \n\n    \n        Highest Rated\n    \n\n\n                \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.6\n                    (12 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            42 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.4\n                    (6 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            41 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.2\n                    (12 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            115 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.8\n                    (6 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            25 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.6\n                    (19 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            151 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.5\n                    (10 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            385 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    4.8\n                    (166 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            754 students enrolled\n        \n    \n\n\n    \n\n    ", 
"\n        \n            \n        \n\n        \n    \n        \n\n    \n        \n        \n            \n                \n            \n        \n        \n            \n                \n                    3.6\n                    (34 ratings)\n                \n                \n                    \n                        Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.\n                    \n                \n            \n        \n        \n    \n\n\n    \n    \n        \n\n    \n        \n            3,396 students enrolled\n        \n    \n\n\n    \n\n    "
)

As you can see, there are 15 objects very long and dirt. However, a pattern inside of them is easily identifiable. Every object is composed of 3 numbers (with the example of the first observation):

I would like to extract all these numerical values and create a dataframe with 3 columns, each for each variable.

I've been checking several questions here in Stackoverflow, mainly focused on the use of gsub() of the package stringr. However, I don't get to find the key solution to my problem.

UPDATE

These are the codes I tried:

as.numeric(str_extract(rs, "[0-9]+"))
as.numeric(str_extract(rs, "[0-9]+")[[1]])
as.numeric(str_extract(rs, "(?<=\\()[0-9]+(?=\\))"))
as.numeric(sapply(strsplit(rs, " "), "[[", 1))

Upvotes: 1

Views: 1964

Answers (3)

acylam
acylam

Reputation: 18701

With extract from tidyr, we can do:

library(dplyr)
library(tidyr)

data.frame(rs, stringsAsFactors = FALSE) %>%
  extract(rs, c("Rating", "Number_of_ratings", "Students_enrolled"),
          "(?s)(\\d\\.\\d).*?(\\d+)\\s*ratings?.*?(\\d+(?:,\\d+)?)\\s*students enrolled", 
          convert = TRUE) %>%
  mutate(Students_enrolled = as.numeric(sub(",", "", Students_enrolled)))

Output:

   Rating Number_of_ratings Students_enrolled
1     4.0                 1                 9
2     4.7                 4                34
3     3.1                 5                22
4     2.4                14              2106
5     4.3                67              1287
6     4.6                 3                30
7     0.0                 0                 8
8     4.6                12                42
9     4.4                 6                41
10    4.2                12               115
11    4.8                 6                25
12    4.6                19               151
13    4.5                10               385
14    4.8               166               754
15    3.6                34              3396

Notes:

The regular expression looks complicated, but it's really not. What extract does is it extracts the match from each capture group (things surrounded by parentheses) and turn them into its own column.

  1. (?s) is a modifier which turns on the "DOTALL" mode. This allows the dot . to also match newline characters.

  2. (\\d\\.\\d) matches the Rating pattern

  3. (\\d+)\\s*ratings matches the Number_of_ratings pattern but only extracts the digits (\\d+)

  4. (\\d+(?:,\\d+)?)\\s*students enrolled matches the Students_enrolled pattern, but only extracts the "digits with or without comma" pattern

  5. convert = TRUE attempts to convert the resulting columns to their best data type, but since there are commas in Students_enrolled, an extra mutate is needed to convert it to numeric

Normally, extract throws an error if the number of capture groups is not equal to the number of output columns, but since modifiers (?s) and non-capturing groups (?:...) are not considered capture groups, the capture group count matches the column count.

Upvotes: 3

hrbrmstr
hrbrmstr

Reputation: 78842

1-dependency base R solution with a commented, readable regex.

This also shows how to clean up the text for processing (in a way that you can re-use).

library(stringi)

do.call(
  rbind.data.frame,
  lapply(
    stri_match_all_regex(
      stri_replace_all_regex(
        stri_trim_both(rs),             # clean up outer spaces
        "[[:blank:][:space:]]+", " "    # clean up inner spaces
      ),
      "
([[:digit:]\\.]+)[[:space:]]+\\(([[:digit:],]+)[[:space:]]+rating[s]*\\)# pick up the rating and total number of ratings
[^[:digit:]]*([[:digit:],]+)[[:space:]]+student[s]*[[:space:]]+enrolled                          # pick up the number of students enrolled
",
      opts_regex = stri_opts_regex(comments = TRUE),
    ),
    function(x) {
      as.list(
        setNames(
          x[2:4], c("rating", "n_ratings", "enrolled")
        ),
        stringsAsFactors = FALSE
      )
    }
  )
)

Resulting in:

##    rating n_ratings enrolled
## 2     4.0         1        9
## 21    4.7         4       34
## 3     3.1         5       22
## 4     2.4        14    2,106
## 5     4.3        67    1,287
## 6     4.6         3       30
## 7     0.0         0        8
## 8     4.6        12       42
## 9     4.4         6       41
## 10    4.2        12      115
## 11    4.8         6       25
## 12    4.6        19      151
## 13    4.5        10      385
## 14    4.8       166      754
## 15    3.6        34    3,396

Turning ^^ into #'s is pretty basic after that.

Upvotes: 3

see24
see24

Reputation: 1230

So your issue is that it doesn't see the "." as a part of a number since it is in a string. So you need to explicitly find the numbers and decimal point.

Rating <- as.numeric(str_extract(rs, "[0-9]\\.[0-9]"))
NRatings <- str_extract(rs, "\\([0-9]") %>% str_replace("\\(","") %>% as.numeric() 

I will let you figure out the last one based on these examples ;)

Upvotes: 2

Related Questions