Reputation: 761
I made a R script that works for me, but I know I could make it better (prettier) with the use of functions. Unfortunately my varied attempts weren't successful. Could anyone guide me on the right track? Below is my original script.
library(dplyr)
apples <- read.csv("JoburgApples.csv")
grs <- apples %>% filter(grepl("GRANNY", ProductName), tvaluesold >10000) %>% mutate(Variety = "Granny Smith")
cpp <- apples %>% filter(grepl("PINK", ProductName), tvaluesold >10000) %>% mutate(Variety = "Cripps Pink")
top <- apples %>% filter(grepl("TOP", ProductName), tvaluesold >10000) %>% mutate(Variety = "Top Red")
gld <- apples %>% filter(grepl("GOLDEN", ProductName), tvaluesold >10000) %>% mutate(Variety = "Golden Delicious")
ski <- apples %>% filter(grepl("STARKING", ProductName), tvaluesold >10000) %>% mutate(Variety = "Starking")
bra <- apples %>% filter(grepl("BRAEBURN", ProductName), tvaluesold >10000) %>% mutate(Variety = "Braeburn")
apples <- rbind(grs, cpp, top, gld, ski, bra)
s70 <- apples %>% filter(grepl("70$", ProductName)) %>% mutate(Count = 70)
s80 <- apples %>% filter(grepl("80$", ProductName)) %>% mutate(Count = 80)
s90 <- apples %>% filter(grepl("90$", ProductName)) %>% mutate(Count = 90)
s100 <- apples %>% filter(grepl("100$", ProductName)) %>% mutate(Count = 100)
s110 <- apples %>% filter(grepl("110$", ProductName)) %>% mutate(Count = 110)
s120 <- apples %>% filter(grepl("120$", ProductName)) %>% mutate(Count = 120)
s135 <- apples %>% filter(grepl("135$", ProductName)) %>% mutate(Count = 135)
s150 <- apples %>% filter(grepl("150$", ProductName)) %>% mutate(Count = 150)
s165 <- apples %>% filter(grepl("165$", ProductName)) %>% mutate(Count = 165)
apples <- rbind(s70, s80, s90, s100, s110, s120, s135, s150, s165)
EDIT. Link to the .csv file (https://github.com/fderyckel/showcases/blob/master/JoburgMarket/JoburgApples.csv)
> UnitMass ProductName tvaluesold tquantitysold tkgsold avgprice highestprice date
> 18.50KG CARTON CRIPPS PINK,CL 1,100 200 1 18.5 200 200 06/11/14
> 18.50KG CARTON CRIPPS RED,CL 1,70 200 1 18.5 200 200 06/11/14
> 18.50KG CARTON TOPRED,CL 1,180 1300 10 185 130 130 06/11/14
> 18.50KG CARTON GOLDEN DELICIOUS,CL 1,90 22700 108 1998 210.19 240 06/11/14
> 18.50KG CARTON STARKING,CL 1,80 17920 115 2127.5 155.83 230 06/11/14
> 18.50KG CARTON GRANNY SMITH,CL 1,135 1800 12 222 150 150 06/11/14
> 18.50KG CARTON TOPRED,CL 1,90 1730 12 222 144.17 190 06/11/14
> 18.50KG CARTON CRIPPS PINK,CL 1,90 2600 13 240.5 200 200 06/11/14
> 18.50KG CARTON GOLDEN DELICIOUS,CL 1,120 22800 136 2516 167.65 180 06/11/14
> 18.50KG CARTON GOLDEN DELICIOUS,CL 1,135 21810 136 2516 160.37 180 06/11/14
> 18.50KG CARTON GRANNY SMITH,CL 1,70 2380 14 259 170 220 06/11/14
> 18.50KG CARTON GRANNY SMITH,CL 1,165 1200 15 277.5 80 80 06/11/14
Thanks in advance for your help.
François
Upvotes: 0
Views: 210
Reputation: 887891
Assuming that you want to replace the "prefix" part of "ProductName" with a custom name, you could use mgsub
from qdap
. This will replace the elements in "ProductName" with the modified elements. A logical indx
is created based on the tvaluesold
, then Variety
column is created with NA
values, Change the rows in Variety
that are TRUE
for the indx
with the modified ProductName
. If you want a new dataset, it is easier to subset by !is.na(apples$Variety)
library(qdap)
indx <- apples$tvaluesold>10000
v1 <- c('GRANNY SMITH', 'CRIPPS PINK', 'TOPRED',
'GOLDEN DELICIOUS','STARKING','BRAEBURN')
v2 <- c('Granny Smith', 'Cripps Pink', 'Top Red','Golden Delicious',
'Starking', 'Braeburn')
apples$Variety <- sub(',.*', '', apples$ProductName)
apples[indx, 'Variety'] <- mgsub(v1,v2, apples[indx,'Variety'] )
apples1 <- apples[indx,]
head(apples1,3)
# UnitMass ProductName tvaluesold tquantitysold tkgsold
#4 18.50KG CARTON GOLDEN DELICIOUS,CL 1,90 22700 108 1998.0
#5 18.50KG CARTON STARKING,CL 1,80 17920 115 2127.5
#9 18.50KG CARTON GOLDEN DELICIOUS,CL 1,120 22800 136 2516.0
# avgprice highestprice date Variety
#4 210.19 240 2014-11-06 Golden Delicious
#5 155.83 230 2014-11-06 Starking
#9 167.65 180 2014-11-06 Golden Delicious
Or using only base R
apples$Variety <- unname(setNames(v2,v1)[sub(',.*', '', apples$ProductName)])
apples1 <- apples[indx,]
For the second case, you can extract the numbers after the last ,
using sub
and then use %in%
to create a logical indx2
.
val1 <- as.numeric(sub(".*,", "", apples$ProductName))
indx2 <- val1 %in% c(70,80,90,100,110,120,135,150,165)
apples$Count <- NA
apples[indx2,'Count'] <- val1[indx2]
apples2 <- apples[!is.na(apples$Count),]
head(apples2,3)
# UnitMass ProductName tvaluesold tquantitysold tkgsold
#1 18.50KG CARTON CRIPPS PINK,CL 1,100 200 1 18.5
#2 18.50KG CARTON CRIPPS RED,CL 1,70 200 1 18.5
#4 18.50KG CARTON GOLDEN DELICIOUS,CL 1,90 22700 108 1998.0
# avgprice highestprice date Variety Count
#1 200.00 200 2014-11-06 CRIPPS PINK 100
#2 200.00 200 2014-11-06 CRIPPS RED 70
#4 210.19 240 2014-11-06 Golden Delicious 90
You could also use dplyr
to create the columns
library(dplyr)
apples %>%
filter(tvaluesold >10000) %>%
mutate(Variety= setNames(v2,v1)[sub(',.*', '', ProductName)])
To create Count
column
apples %>%
filter(indx2) %>%
mutate(Count=val1[indx2])
If you want to extract the "first" and "last" of the "ProductName", another option is
library(tidyr)
res1 <- extract(apples, ProductName, c("Variety", "Count"),
'([^,]+),[^,]+,([^,]+)') %>%
filter(tvaluesold >10000L & !is.na(as.numeric(Count))
url <- 'https://raw.githubusercontent.com/fderyckel/showcases/master/JoburgMarket/JoburgApples.csv'
library(RCurl)
x <- getURL(url)
apples <- read.csv(textConnection(x), stringsAsFactors=FALSE)
Upvotes: 5
Reputation: 70336
Maybe all you need is this:
apples %>%
filter(tvaluesold > 10000L & grepl(".*\\d+$", ProductName)) %>%
mutate(Variety = sub(",.*", "", ProductName),
Count = as.numeric(sub(".*,", "", ProductName)))
Upvotes: 4