Franky
Franky

Reputation: 761

function in R (with dplyr)

I made a R script that works for me, but I know I could make it better (prettier) with the use of functions. Unfortunately my varied attempts weren't successful. Could anyone guide me on the right track? Below is my original script.

library(dplyr)

apples <- read.csv("JoburgApples.csv")

grs <- apples %>% filter(grepl("GRANNY", ProductName), tvaluesold >10000) %>% mutate(Variety = "Granny Smith")
cpp <- apples %>% filter(grepl("PINK", ProductName), tvaluesold >10000) %>% mutate(Variety = "Cripps Pink")
top <- apples %>% filter(grepl("TOP", ProductName), tvaluesold >10000) %>% mutate(Variety = "Top Red")
gld <- apples %>% filter(grepl("GOLDEN", ProductName), tvaluesold >10000) %>% mutate(Variety = "Golden Delicious")
ski <- apples %>% filter(grepl("STARKING", ProductName), tvaluesold >10000) %>% mutate(Variety = "Starking")
bra <- apples %>% filter(grepl("BRAEBURN", ProductName), tvaluesold >10000) %>% mutate(Variety = "Braeburn")

apples <- rbind(grs, cpp, top, gld, ski, bra)

s70 <- apples %>% filter(grepl("70$", ProductName)) %>% mutate(Count = 70)
s80 <- apples %>% filter(grepl("80$", ProductName)) %>% mutate(Count = 80)
s90 <- apples %>% filter(grepl("90$", ProductName)) %>% mutate(Count = 90)
s100 <- apples %>% filter(grepl("100$", ProductName)) %>% mutate(Count = 100)
s110 <- apples %>% filter(grepl("110$", ProductName)) %>% mutate(Count = 110)
s120 <- apples %>% filter(grepl("120$", ProductName)) %>% mutate(Count = 120)
s135 <- apples %>% filter(grepl("135$", ProductName)) %>% mutate(Count = 135)
s150 <- apples %>% filter(grepl("150$", ProductName)) %>% mutate(Count = 150)
s165 <- apples %>% filter(grepl("165$", ProductName)) %>% mutate(Count = 165)

apples <- rbind(s70, s80, s90, s100, s110, s120, s135, s150, s165)

EDIT. Link to the .csv file (https://github.com/fderyckel/showcases/blob/master/JoburgMarket/JoburgApples.csv)

> UnitMass  ProductName tvaluesold  tquantitysold   tkgsold avgprice    highestprice    date
> 18.50KG CARTON    CRIPPS PINK,CL 1,100    200 1   18.5    200 200 06/11/14
> 18.50KG CARTON    CRIPPS RED,CL 1,70  200 1   18.5    200 200 06/11/14
> 18.50KG CARTON    TOPRED,CL 1,180 1300    10  185 130 130 06/11/14
> 18.50KG CARTON    GOLDEN DELICIOUS,CL 1,90    22700   108 1998    210.19  240 06/11/14
> 18.50KG CARTON    STARKING,CL 1,80    17920   115 2127.5  155.83  230 06/11/14
> 18.50KG CARTON    GRANNY SMITH,CL 1,135   1800    12  222 150 150 06/11/14
> 18.50KG CARTON    TOPRED,CL 1,90  1730    12  222 144.17  190 06/11/14
> 18.50KG CARTON    CRIPPS PINK,CL 1,90 2600    13  240.5   200 200 06/11/14
> 18.50KG CARTON    GOLDEN DELICIOUS,CL 1,120   22800   136 2516    167.65  180 06/11/14
> 18.50KG CARTON    GOLDEN DELICIOUS,CL 1,135   21810   136 2516    160.37  180 06/11/14
> 18.50KG CARTON    GRANNY SMITH,CL 1,70    2380    14  259 170 220 06/11/14
> 18.50KG CARTON    GRANNY SMITH,CL 1,165   1200    15  277.5   80  80  06/11/14

Thanks in advance for your help.

François

Upvotes: 0

Views: 210

Answers (2)

akrun
akrun

Reputation: 887891

Assuming that you want to replace the "prefix" part of "ProductName" with a custom name, you could use mgsub from qdap. This will replace the elements in "ProductName" with the modified elements. A logical indx is created based on the tvaluesold, then Variety column is created with NA values, Change the rows in Variety that are TRUE for the indx with the modified ProductName. If you want a new dataset, it is easier to subset by !is.na(apples$Variety)

library(qdap)
indx <- apples$tvaluesold>10000
v1 <- c('GRANNY SMITH', 'CRIPPS PINK', 'TOPRED',
                    'GOLDEN DELICIOUS','STARKING','BRAEBURN')

 v2 <- c('Granny Smith', 'Cripps Pink', 'Top Red','Golden Delicious',
       'Starking', 'Braeburn')
 apples$Variety <- sub(',.*', '', apples$ProductName)

 apples[indx, 'Variety'] <- mgsub(v1,v2, apples[indx,'Variety']  )
 apples1 <- apples[indx,]

 head(apples1,3)
 #       UnitMass               ProductName tvaluesold tquantitysold tkgsold
 #4 18.50KG CARTON  GOLDEN DELICIOUS,CL 1,90      22700           108  1998.0
 #5 18.50KG CARTON          STARKING,CL 1,80      17920           115  2127.5
 #9 18.50KG CARTON GOLDEN DELICIOUS,CL 1,120      22800           136  2516.0
 #  avgprice highestprice       date          Variety
 #4   210.19          240 2014-11-06 Golden Delicious
 #5   155.83          230 2014-11-06         Starking
 #9   167.65          180 2014-11-06 Golden Delicious

Or using only base R

 apples$Variety <-  unname(setNames(v2,v1)[sub(',.*', '', apples$ProductName)])
 apples1 <- apples[indx,]

For the second case, you can extract the numbers after the last , using sub and then use %in% to create a logical indx2.

 val1 <-  as.numeric(sub(".*,", "", apples$ProductName))
 indx2 <-  val1 %in% c(70,80,90,100,110,120,135,150,165)

 apples$Count <- NA
 apples[indx2,'Count'] <-  val1[indx2]
 apples2 <- apples[!is.na(apples$Count),]
 head(apples2,3)
 #       UnitMass              ProductName tvaluesold tquantitysold tkgsold
 #1 18.50KG CARTON     CRIPPS PINK,CL 1,100        200             1    18.5
 #2 18.50KG CARTON       CRIPPS RED,CL 1,70        200             1    18.5
 #4 18.50KG CARTON GOLDEN DELICIOUS,CL 1,90      22700           108  1998.0
 #  avgprice highestprice       date          Variety Count
 #1   200.00          200 2014-11-06      CRIPPS PINK   100
 #2   200.00          200 2014-11-06       CRIPPS RED    70
 #4   210.19          240 2014-11-06 Golden Delicious    90

Update

You could also use dplyr to create the columns

library(dplyr)
apples %>%
       filter(tvaluesold >10000) %>% 
       mutate(Variety= setNames(v2,v1)[sub(',.*', '', ProductName)])

To create Count column

 apples %>%
        filter(indx2) %>%
        mutate(Count=val1[indx2]) 

Update2

If you want to extract the "first" and "last" of the "ProductName", another option is

 library(tidyr)
 res1 <-  extract(apples, ProductName, c("Variety", "Count"),
                   '([^,]+),[^,]+,([^,]+)') %>%
                        filter(tvaluesold >10000L & !is.na(as.numeric(Count))

data

 url <- 'https://raw.githubusercontent.com/fderyckel/showcases/master/JoburgMarket/JoburgApples.csv'
 library(RCurl)

 x <- getURL(url)
 apples <- read.csv(textConnection(x), stringsAsFactors=FALSE)

Upvotes: 5

talat
talat

Reputation: 70336

Maybe all you need is this:

apples %>%
  filter(tvaluesold > 10000L & grepl(".*\\d+$", ProductName)) %>%
  mutate(Variety = sub(",.*", "", ProductName),
         Count = as.numeric(sub(".*,", "", ProductName)))

Upvotes: 4

Related Questions