Reputation: 587
I have a data frame called "prices" obtained by web scraping. The goal is to track the daily prices of the stocks on the Zimbabwe stock exchange.
Web scraping from the site:
library(rvest)
library(stringr)
library(reshape2)
# Data from African Financials
url <- "https://africanfinancials.com/zimbabwe-stock-exchange-share-prices/"
prices <- url %>%
read_html() %>%
html_table(fill = T)
prices <- prices[[1]]
The prices dataframe:
> prices
Counter PriceRTGS cents Volume ChangeRTGS cents ChangePercent YTDPercent
1 AFDS.zw Afdis 169.75 4 Apr 19 0 0.00 0.00% 10.95%
2 ARIS.zw Ariston 2.90 4 Apr 19 572 -0.03 -1.02% 20.83%
3 ARTD.zw ART Holdings 9.20 4 Apr 19 0 0.00 0.00% 4.55%
I'd like to split the "PriceRTGS cents" column into two columns "Price RTGS Cents" and "Date".
I tried using the below code but it captures the day of the month, 4, in the price column.
str_split_fixed(prices$`PriceRTGS cents`," ", 2)
colsplit(prices$`PriceRTGS cents`," ",c("Price RTGS Cents", "Date"))
I would like for the output to look like this:
Counter Price RTGS Cents Date Volume ChangeRTGS cents ChangePercent YTDPercent
1 AFDS.zw Afdis 169.75 4/04/2019 0 0.00 0.00% 10.95%
2 ARIS.zw Ariston 2.90 4/04/2019 572 -0.03 -1.02% 20.83%
3 ARTD.zw ART Holdings 9.20 4/04/2019 0 0.00 0.00% 4.55%
dput data:
structure(list(Counter = c("AFDS.zw Afdis", "ARIS.zw Ariston",
"ARTD.zw ART Holdings", "ASUN.zw Africansun", "AXIA.zw Axia",
"BAT.zw BAT"), `PriceRTGS cents` = c("169.75 4 Apr 19", "2.90 4 Apr 19",
"9.20 4 Apr 19", "15.00 4 Apr 19", "35.05 4 Apr 19", "3,000.00 4 Apr 19"
), Volume = c("0", "572", "0", "0", "8,557", "0"), `ChangeRTGS cents` = c(0,
-0.03, 0, 0, 0, 0), ChangePercent = c("0.00%", "-1.02%", "0.00%",
"0.00%", "0.00%", "0.00%"), YTDPercent = c("10.95%", "20.83%",
"4.55%", "50.00%", "-22.11%", "-9.09%")), row.names = c(NA, 6L
), class = "data.frame")
Upvotes: 3
Views: 277
Reputation: 5138
Here you go: similar to your str_split_fixed
solution. It also removes the commas from your price variable so it can be coerced into numeric
and formats the date column.
split_string <- str_split(prices$`PriceRTGS cents`, regex("\\s"), 2, simplify = T)
prices$Price <- as.numeric(gsub(",", "", split_string[,1], fixed = T))
prices$Date <- as.Date(split_string[,2], format = "%d %b %y")
head(prices[-2])
Counter Volume ChangeRTGS cents ChangePercent YTDPercent Price Date
1 AFDS.zw Afdis 0 0.00 0.00% 10.95% 169.75 2019-04-04
2 ARIS.zw Ariston 572 -0.03 -1.02% 20.83% 2.90 2019-04-04
3 ARTD.zw ART Holdings 0 0.00 0.00% 4.55% 9.20 2019-04-04
4 ASUN.zw Africansun 0 0.00 0.00% 50.00% 15.00 2019-04-04
5 AXIA.zw Axia 8,557 0.00 0.00% -22.11% 35.05 2019-04-04
6 BAT.zw BAT 0 0.00 0.00% -9.09% 3000.00 2019-04-04
The issue with the fixed solution is that it was not recognizing the space after the price, i.e.:
table(str_count(prices$`PriceRTGS cents`, fixed(" ")))
2
55
But it did using regex for whitespace, i.e.:
table(str_count(prices$`PriceRTGS cents`, regex("\\s")))
3
55
Upvotes: 0
Reputation: 13319
An alternative. Separator(-) and date formats,column names can be changed:
prices$Prices<-stringr::str_extract_all(prices$`PriceRTGS cents`,"\\d{1,}.*\\.\\d{1,}",simplify=T)
prices$Dates<-stringr::str_remove_all(prices$`PriceRTGS cents`,"\\d{1,}.*\\.\\d{1,} ")
prices %>%
select(-`PriceRTGS cents`) %>%
mutate(Dates=lubridate::dmy(Dates))
Result:
Counter Volume ChangeRTGS cents ChangePercent YTDPercent Prices Dates
1 AFDS.zw Afdis 0 0.00 0.00% 10.95% 169.75 2019-04-04
2 ARIS.zw Ariston 572 -0.03 -1.02% 20.83% 2.90 2019-04-04
3 ARTD.zw ART Holdings 0 0.00 0.00% 4.55% 9.20 2019-04-04
4 ASUN.zw Africansun 0 0.00 0.00% 50.00% 15.00 2019-04-04
5 AXIA.zw Axia 8,557 0.00 0.00% -22.11% 35.05 2019-04-04
6 BAT.zw BAT 0 0.00 0.00% -9.09% 3,000.00 2019-04-04
Upvotes: 1
Reputation: 388982
A base R option is to split on empty space and create two parts of the string , first the price part and the remaining part together as date.
t(sapply(strsplit(prices$`PriceRTGS cents`,"\\s+"), function(x)
c(x[1], format(as.Date(paste0(x[-1], collapse = "-"), "%d-%b-%y"), "%d/%m/%Y"))))
# [,1] [,2]
#[1,] "169.75" "04/04/2019"
#[2,] "2.90" "04/04/2019"
#[3,] "9.20" "04/04/2019"
#[4,] "15.00" "04/04/2019"
#[5,] "35.05" "04/04/2019"
#[6,] "3,000.00" "04/04/2019"
You can then cbind
these two columns to the original dataframe.
If you are ok to keep date columns as it is without any formatting, we can let go off and as.Date
and format
and shorten the code.
t(sapply(strsplit(prices$`PriceRTGS cents`,"\\s+"), function(x)
c(x[1], paste0(x[-1], collapse = "-"))))
Upvotes: 0
Reputation: 301
I've just copy and paste your first prices datas into text editor and change the spaces with ";" (I had not seen your data edition yet).
prices <- read.table("dat.txt", sep=";", header=T)
Kind of "quick and dirty" code, but it's working:
str_split_fixed(prices$PriceRTGS.cents," ", 2)
new_prices <- data.frame(prices$Counter, str_split_fixed(prices$PriceRTGS.cents," ", 2), prices$Volume, prices$ChangeRTGS.cents, prices$ChangePercent, prices$YTDPercent)
colnames(new_prices) <- c("Counter", "PriceRTGS_cents", "Date", "Volume", "ChangeRTGS cents", "ChangePercent", "YTDPercent")
new_prices$Date <- gsub("Apr", "04", new_prices$Date)
new_prices$Date <- gsub(" ", "/", new_prices$Date)
new_prices <- data.frame(prices$Counter, new_prices$PriceRTGS_cents, new_prices$Date, prices$Volume, prices$ChangeRTGS.cents, prices$ChangePercent, prices$YTDPercent)
colnames(new_prices) <- c("Counter", "PriceRTGS_cents", "Date", "Volume", "ChangeRTGS cents", "ChangePercent", "YTDPercent")
new_prices
If you have other months than 'Apr', juste add others lines (eg: if "Nov")
new_prices$Date <- gsub("Nov", "10", new_prices$Date)
new_prices$Date <- gsub(" ", "/", new_prices$Date)
Upvotes: 1
Reputation: 2764
you can do somthing this-
library(data.table)
setDT(dt)
dt[,Date:=sub("^\\S+\\s+", "\\1", `PriceRTGS cents`)]
dt[,cents:=sub("^\\s*(\\S+\\S+).*", "\\1", `PriceRTGS cents`)]
Note- Remove original column later from dt
> dt <- subset(dt,select = -c(`PriceRTGS cents`))
> dt
Counter Volume ChangeRTGS cents ChangePercent YTDPercent cents Date
1: AFDS.zw Afdis 0 0.00 0.00% 10.95% 169.75 4 Apr 19
2: ARIS.zw Ariston 572 -0.03 -1.02% 20.83% 2.90 4 Apr 19
3: ARTD.zw ART Holdings 0 0.00 0.00% 4.55% 9.20 4 Apr 19
4: ASUN.zw Africansun 0 0.00 0.00% 50.00% 15.00 4 Apr 19
5: AXIA.zw Axia 8,557 0.00 0.00% -22.11% 35.05 4 Apr 19
6: BAT.zw BAT 0 0.00 0.00% -9.09% 3,000.00 4 Apr 19
EDIT- If you want Date
as you mentioned then do this-
dt[,Date:=format(as.Date(sub("^\\S+\\s+", "\\1", `PriceRTGS cents`),format='%d %b %Y'),"%d/%m/%Y")]
Upvotes: 0