KintensT
KintensT

Reputation: 9

Places after decimal points discarded when extracting numbers from strings

I'd like to extract weight values from strings with the unit and the time of measurement using tidyverse.

My dataset is like as below:

df <- tibble(ID = c("A","B","C"), 
             Weight = c("45kg^20221120", "51.5kg^20221015", "66.05kg^20221020"))

------
A tibble: 3 × 2
  ID    Weight          
  <chr> <chr>           
1 A     45kg^20221120   
2 B     11.5kg^20221015 
3 C     66.05kg^20221020

I use stringr in the tidyverse package with regular expressions.

library(tidyverse)
df %>%
  mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)")))

----------
A tibble: 3 × 3
  ID    Measurement      Weight
  <chr> <chr>             <dbl>
1 A     45kg^20221120      45  
2 B     11.5kg^20221015    11.5
3 C     66.05kg^20221020   66.0

The second decimal place of C (.05) doesn't extracted. What's wrong with my code? Any answers or comments are welcome.

Thanks.

Upvotes: 0

Views: 72

Answers (2)

Ruam Pimentel
Ruam Pimentel

Reputation: 1329

Yes, it was extracted, however tibble is rounding it for 66.0 for easy display.

You can see it if you transform it in data.frame or if you View it

Solution

Check here

Check this

df %>%
  mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)"))) %>% 
  as.data.frame()

Output

#>   ID      Measurement Weight
#> 1  A    45kg^20221120  45.00
#> 2  B  51.5kg^20221015  51.50
#> 3  C 66.05kg^20221020  66.05

Or check this

df %>%
  mutate(Weight = as.numeric(str_extract(Measurement, "(\\d+\\.\\d+)|(\\d+)(?=kg)"))) %>% 
  View()

enter image description here

Upvotes: 1

AndS.
AndS.

Reputation: 8120

You could try to pull all the data out of the string at once with extract:

library(tidyverse)

df <- tibble(ID = c("A","B","C"), 
             Weight = c("45kg^20221120", "51.5kg^20221015", "66.05kg^20221020"))

df |>
  extract(col = Weight, 
          into = c("weight", "unit", "date"),
          regex = "(.*)(kg)\\^(.*$)", 
          remove = TRUE, 
          convert = TRUE) |>
  mutate(date = lubridate::ymd(date))
#> # A tibble: 3 x 4
#>   ID    weight unit  date      
#>   <chr>  <dbl> <chr> <date>    
#> 1 A       45   kg    2022-11-20
#> 2 B       51.5 kg    2022-10-15
#> 3 C       66.0 kg    2022-10-20

Note that, as stated in the comments, the .05 is just not printing, but is present in the data.

Upvotes: 1

Related Questions