how to extract some data from cells?

I would like to extract number information in cells. My data looks like this.

      item       stock
PRE 24DX4SX15G    200      
PLS 12RX10SX15G   200
ADU 24PX200ML     200
NIS 18PX40SX11G   200
REF 500GX12D      200

i want to extract the numbers which located besides alphabet D,R and P. I want to use this number to do multiplication with the stock. For example like this.

          item       stock    pcs    total
    PRE 24DX4SX15G    200      24    4800
    PLS 12RX10SX15G   200      12    2400 
    ADU 24PX200ML     200      24    4800
    NIS 18PX40SX11G   200      18    3600
    REF 500GX12D      200      12    2400

anyone know how to extract the numbers? thanks very much in advance

Upvotes: 0

Views: 65

Answers (2)

yarnabrina
yarnabrina

Reputation: 1666

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)

dataset <- read.table(text = '      item       stock
PRE 24DX4SX15G    200      
PLS 12RX10SX15G   200
ADU 24PX200ML     200
NIS 18PX40SX11G   200
REF 500GX12D      200',
                      header = TRUE)

dataset %>%
  mutate(pcs = as.numeric(x = str_sub(string = str_extract(string = item,
                                                           pattern = "\\d+[DRP]"),
                                      start = 1,
                                      end = -2)),
         total = (stock * pcs))
#>          item stock pcs total
#> 1  24DX4SX15G   200  24  4800
#> 2 12RX10SX15G   200  12  2400
#> 3   24PX200ML   200  24  4800
#> 4 18PX40SX11G   200  18  3600
#> 5    500GX12D   200  12  2400

Created on 2019-07-12 by the reprex package (v0.3.0)

Hope this helps.

Upvotes: 2

January
January

Reputation: 17110

I would use a suitable regular expression:

df$pcs <- gsub("^(.*[^0-9]|)([0-9]+)[DPR].*", "\\2", df$item)
df$pcs <- as.numeric(df$pcs)

Explanation:

  • [DPR] means we are matching one character which is either D, P or R
  • [^0-9] means we are matching anything but one of the digits 0-9
  • The parentheses create two regular expression groups, first being a prefix and the second being the actual number we want to match.
  • we can use the second group in substitute by preceding 2 with a backslash; only to get a backslash in R, we need to enter two backslashes: \\2
  • The number of interest (preceding a D, P or R) can be found at different locations. Either it is at the beginning, in which case we should match ^([0-9]+)[DPR], or it is somewhere inside the text. Therefore we have to first match the beginning of string (^) and then alternatively match either nothing or anything that does not end in a number (.*[^0-9]).

EDIT: with the stringr library (as @yamabrina showed in the other answer) the regular expression is much simpler:

library(stringr)
df$pcs <- as.numeric(str_sub(str_extract(df$item, "[0-9]+[DPR]"), 1, -2))

Upvotes: 1

Related Questions