Reputation: 1053
I have downloaded a series of tables from this website:
url <- "https://www.ato.gov.au/Rates/Individual-income-tax-for-prior-years/"
df <- url %>%
read_html() %>%
html_table() %>%
setNames(., url %>%
read_html() %>%
html_nodes("caption") %>%
html_text())
I need to extract numbers from the Tax on this income
variable contained in the tables:
$`Resident tax rates for 2016-17`
Taxable income Tax on this income
1 0 – $18,200 Nil
2 $18,201 – $37,000 19c for each $1 over $18,200
3 $37,001 – $87,000 $3,572 plus 32.5c for each $1 over $37,000
4 $87,001 – $180,000 $19,822 plus 37c for each $1 over $87,000
5 $180,001 and over $54,232 plus 45c for each $1 over $180,000
Ideally I would like to add three columns to each table with the following data:
new col 1: NA, 3572, 19822, 54232
new col 2: 19, 32.5, 37, 45
new col 3: 18200, 37000, 87000, 180000
Most tables follow the format of the table above but some have more rows, and some use " cents" - i.e. row 2, col 2 would then read:
19 cents for each $1 over $18,200
So the regex pattern needs to match 19c and 19 cents.
str_extract_all(df$
Resident tax rates for 2016-17[2], pattern = "(?<=\\$)\\d*,\\d{3}")
This pattern only matches the dollar amounts and returns a character vector (both undesirable).
Upvotes: 1
Views: 376
Reputation: 388982
Here's using 3 different expressions for 3 columns
library(dplyr)
library(stringr)
df[[1]] %>%
mutate(`Tax on this income` = gsub(",", "", `Tax on this income`),
col1 = str_extract(`Tax on this income`, "(?<=^\\$)\\d+"),
col2 = str_extract(`Tax on this income`, "\\d+.(\\d+)?(?=(\\s+)?c)"),
col3 = str_extract(`Tax on this income`, "(?<=\\$)\\d+$"))
# Taxable income Tax on this income col1 col2 col3
#1 0 – $18,200 Nil <NA> <NA> <NA>
#2 $18,201 – $37,000 19c for each $1 over $18200 <NA> 19 18200
#3 $37,001 – $87,000 $3572 plus 32.5c for each $1 over $37000 3572 32.5 37000
#4 $87,001 – $180,000 $19822 plus 37c for each $1 over $87000 19822 37 87000
#5 $180,001 and over $54232 plus 45c for each $1 over $180000 54232 45 180000
Since "cents"
starts with "c"
as well, this will also works when you have "cents" instead of "c".
df[[19]] %>%
mutate(`Tax on this income` = gsub(",", "", `Tax on this income`),
col1 = str_extract(`Tax on this income`, "(?<=^\\$)\\d+"),
col2 = str_extract(`Tax on this income`, "\\d+.(\\d+)?(?=(\\s+)?c)"),
col3 = str_extract(`Tax on this income`, "(?<=\\$)\\d+$"))
# Taxable income Tax on this income col1 col2 col3
#1 $1 – $5,400 Nil <NA> <NA> <NA>
#2 $5,401 – $20,700 20 cents for each $1 over $5400 <NA> 20 5400
#3 $20,701 – $38,000 $3060 plus 34 cents for each $1 over $20700 3060 34 20700
#4 $38,001 – $50,000 $8942 plus 43 cents for each $1 over $38000 8942 43 38000
#5 $50,001 and over $14102 plus 47 cents for each $1 over $50000 14102 47 50000
As you have list of dataframes, you may use map
to apply this to each one of them
purrr::map(df,.%>%
mutate(`Tax on this income` = gsub(",", "", `Tax on this income`),
col1 = str_extract(`Tax on this income`, "(?<=^\\$)\\d+"),
col2 = str_extract(`Tax on this income`, "\\d+.(\\d+)?(?=(\\s+)?c)"),
col3 = str_extract(`Tax on this income`, "(?<=\\$)\\d+$")))
Upvotes: 1
Reputation: 79228
pattern = "(?:\\$(\\S+)\\s*plus\\s*)?(\\d++[.]?\\d*)\\s*c.*\\$(\\d++,.*)|.*Nil.*"
clean = function(x){
nw = gsub(',','',trimws(gsub(pattern,'\\1:\\2:\\3',x[,2],perl=T)))
cbind(x,read.table(text = nw,fill=T,sep = ':',col.names = paste0('col',1:3)))
}
lapply(df,clean)
`Resident tax rates for 1983-84`
Taxable income Tax on this income col1 col2 col3
1 $1 – $4,594 Nil NA NA NA
2 $4,595 – $19,499 30 cents for each $1 over $4,595 NA 30 4595
3 $19,500 – $35,787 $4,471.50 plus 46 cents for each $1 over $19,500 4471.50 46 19500
4 $35,788 and over $11,963.98 plus 60 cents for each $1 over $35,788 11963.98 60 35788
Upvotes: 0
Reputation: 27723
This is pretty complicated to design an expression. Maybe, let's design two expressions for each table, then we would script the rest of our problem.
For table taxable income, for instance, we can start with an expression similar to:
(\d+)(\s+)?(\$?([0-9,]+)[\s–]+\$?([0-9,]+|and over)?)
and for the other table:
\s+Nil|\$?([0-9,]+)?\s+?(plus\s+)?([0-9,.]+)c?\s+for each\s+(\$1 over)\s+\$?([0-9,]+)
jex.im visualizes regular expressions:
Upvotes: 0