Reputation: 1053

Regex: extract multiple numbers from single line of text

The problem

I have downloaded a series of tables from this website:

url <- "https://www.ato.gov.au/Rates/Individual-income-tax-for-prior-years/"
df <- url %>%
  read_html() %>%
  html_table() %>%
  setNames(., url %>%
             read_html() %>%
             html_nodes("caption") %>%
             html_text())

I need to extract numbers from the Tax on this income variable contained in the tables:

$`Resident tax rates for 2016-17`
      Taxable income                         Tax on this income
1        0 – $18,200                                        Nil
2  $18,201 – $37,000               19c for each $1 over $18,200
3  $37,001 – $87,000 $3,572 plus 32.5c for each $1 over $37,000
4 $87,001 – $180,000  $19,822 plus 37c for each $1 over $87,000
5  $180,001 and over $54,232 plus 45c for each $1 over $180,000

Ideally I would like to add three columns to each table with the following data:

new col 1: NA, 3572, 19822, 54232

new col 2: 19, 32.5, 37, 45

new col 3: 18200, 37000, 87000, 180000

Most tables follow the format of the table above but some have more rows, and some use " cents" - i.e. row 2, col 2 would then read:

19 cents for each $1 over $18,200

So the regex pattern needs to match 19c and 19 cents.

My (poor) attempt

str_extract_all(df$Resident tax rates for 2016-17[2], pattern = "(?<=\\$)\\d*,\\d{3}")

This pattern only matches the dollar amounts and returns a character vector (both undesirable).

Upvotes: 1

Answers (3)

Ronak Shah

Reputation: 389355

Here's using 3 different expressions for 3 columns

library(dplyr)
library(stringr)

df[[1]] %>%
   mutate(`Tax on this income` = gsub(",", "", `Tax on this income`), 
          col1 = str_extract(`Tax on this income`, "(?<=^\\$)\\d+"), 
          col2 = str_extract(`Tax on this income`, "\\d+.(\\d+)?(?=(\\s+)?c)"),
          col3 = str_extract(`Tax on this income`, "(?<=\\$)\\d+$"))

#      Taxable income                       Tax on this income  col1 col2   col3
#1        0 – $18,200                                      Nil  <NA> <NA>   <NA>
#2  $18,201 – $37,000              19c for each $1 over $18200  <NA>   19  18200
#3  $37,001 – $87,000 $3572 plus 32.5c for each $1 over $37000  3572 32.5  37000
#4 $87,001 – $180,000  $19822 plus 37c for each $1 over $87000 19822   37  87000
#5  $180,001 and over $54232 plus 45c for each $1 over $180000 54232   45 180000

Since "cents" starts with "c" as well, this will also works when you have "cents" instead of "c".

df[[19]] %>%
  mutate(`Tax on this income` = gsub(",", "", `Tax on this income`), 
          col1 = str_extract(`Tax on this income`, "(?<=^\\$)\\d+"), 
          col2 = str_extract(`Tax on this income`, "\\d+.(\\d+)?(?=(\\s+)?c)"),
          col3 = str_extract(`Tax on this income`, "(?<=\\$)\\d+$"))


#     Taxable income                           Tax on this income  col1 col2  col3
#1       $1 – $5,400                                          Nil  <NA> <NA>  <NA>
#2  $5,401 – $20,700              20 cents for each $1 over $5400  <NA>  20   5400
#3 $20,701 – $38,000  $3060 plus 34 cents for each $1 over $20700  3060  34  20700
#4 $38,001 – $50,000  $8942 plus 43 cents for each $1 over $38000  8942  43  38000
#5  $50,001 and over $14102 plus 47 cents for each $1 over $50000 14102  47  50000

As you have list of dataframes, you may use map to apply this to each one of them

purrr::map(df,.%>%
             mutate(`Tax on this income` = gsub(",", "", `Tax on this income`), 
             col1 = str_extract(`Tax on this income`, "(?<=^\\$)\\d+"), 
             col2 = str_extract(`Tax on this income`, "\\d+.(\\d+)?(?=(\\s+)?c)"),
             col3 = str_extract(`Tax on this income`, "(?<=\\$)\\d+$")))

Upvotes: 1

Onyambu

Reputation: 79388

pattern = "(?:\\$(\\S+)\\s*plus\\s*)?(\\d++[.]?\\d*)\\s*c.*\\$(\\d++,.*)|.*Nil.*"

clean = function(x){
  nw = gsub(',','',trimws(gsub(pattern,'\\1:\\2:\\3',x[,2],perl=T)))
  cbind(x,read.table(text = nw,fill=T,sep = ':',col.names = paste0('col',1:3)))
}

lapply(df,clean)

`Resident tax rates for 1983-84`
     Taxable income                                Tax on this income     col1 col2  col3
1       $1 – $4,594                                               Nil       NA   NA    NA
2  $4,595 – $19,499                  30 cents for each $1 over $4,595       NA   30  4595
3 $19,500 – $35,787  $4,471.50 plus 46 cents for each $1 over $19,500  4471.50   46 19500
4  $35,788 and over $11,963.98 plus 60 cents for each $1 over $35,788 11963.98   60 35788

Upvotes: 0

Emma

Reputation: 27763

This is pretty complicated to design an expression. Maybe, let's design two expressions for each table, then we would script the rest of our problem.

For table taxable income, for instance, we can start with an expression similar to:

(\d+)(\s+)?(\$?([0-9,]+)[\s–]+\$?([0-9,]+|and over)?)

Demo 1

and for the other table:

\s+Nil|\$?([0-9,]+)?\s+?(plus\s+)?([0-9,.]+)c?\s+for each\s+(\$1 over)\s+\$?([0-9,]+)

Demo 2

RegEx Circuit

jex.im visualizes regular expressions: