gm007
gm007

Reputation: 587

Split string column into 2 columns, one numeric, and the other a date

I have a data frame called "prices" obtained by web scraping. The goal is to track the daily prices of the stocks on the Zimbabwe stock exchange.

Web scraping from the site:

library(rvest)
library(stringr)
library(reshape2)
# Data from African Financials
url <- "https://africanfinancials.com/zimbabwe-stock-exchange-share-prices/"
prices <- url %>%
  read_html() %>%
  html_table(fill = T)
prices <- prices[[1]]

The prices dataframe:

> prices

                   Counter   PriceRTGS cents  Volume ChangeRTGS cents ChangePercent YTDPercent
1            AFDS.zw Afdis   169.75 4 Apr 19       0             0.00         0.00%     10.95%
2          ARIS.zw Ariston     2.90 4 Apr 19     572            -0.03        -1.02%     20.83%
3     ARTD.zw ART Holdings     9.20 4 Apr 19       0             0.00         0.00%      4.55%

I'd like to split the "PriceRTGS cents" column into two columns "Price RTGS Cents" and "Date".

I tried using the below code but it captures the day of the month, 4, in the price column.

str_split_fixed(prices$`PriceRTGS cents`," ", 2)
colsplit(prices$`PriceRTGS cents`," ",c("Price RTGS Cents", "Date"))

I would like for the output to look like this:

                   Counter   Price RTGS Cents              Date         Volume ChangeRTGS cents ChangePercent YTDPercent
1            AFDS.zw Afdis             169.75         4/04/2019              0             0.00         0.00%     10.95%
2          ARIS.zw Ariston               2.90         4/04/2019            572            -0.03        -1.02%     20.83%
3     ARTD.zw ART Holdings               9.20         4/04/2019              0             0.00         0.00%      4.55%

dput data:

structure(list(Counter = c("AFDS.zw Afdis", "ARIS.zw Ariston", 
"ARTD.zw ART Holdings", "ASUN.zw Africansun", "AXIA.zw Axia", 
"BAT.zw BAT"), `PriceRTGS cents` = c("169.75 4 Apr 19", "2.90 4 Apr 19", 
"9.20 4 Apr 19", "15.00 4 Apr 19", "35.05 4 Apr 19", "3,000.00 4 Apr 19"
), Volume = c("0", "572", "0", "0", "8,557", "0"), `ChangeRTGS cents` = c(0, 
-0.03, 0, 0, 0, 0), ChangePercent = c("0.00%", "-1.02%", "0.00%", 
"0.00%", "0.00%", "0.00%"), YTDPercent = c("10.95%", "20.83%", 
"4.55%", "50.00%", "-22.11%", "-9.09%")), row.names = c(NA, 6L
), class = "data.frame")

Upvotes: 3

Views: 277

Answers (5)

Andrew
Andrew

Reputation: 5138

Here you go: similar to your str_split_fixed solution. It also removes the commas from your price variable so it can be coerced into numeric and formats the date column.

split_string <- str_split(prices$`PriceRTGS cents`, regex("\\s"), 2, simplify = T)

prices$Price <- as.numeric(gsub(",", "", split_string[,1], fixed = T))
prices$Date <- as.Date(split_string[,2], format = "%d %b %y")

head(prices[-2])
               Counter Volume ChangeRTGS cents ChangePercent YTDPercent   Price       Date
1        AFDS.zw Afdis      0             0.00         0.00%     10.95%  169.75 2019-04-04
2      ARIS.zw Ariston    572            -0.03        -1.02%     20.83%    2.90 2019-04-04
3 ARTD.zw ART Holdings      0             0.00         0.00%      4.55%    9.20 2019-04-04
4   ASUN.zw Africansun      0             0.00         0.00%     50.00%   15.00 2019-04-04
5         AXIA.zw Axia  8,557             0.00         0.00%    -22.11%   35.05 2019-04-04
6           BAT.zw BAT      0             0.00         0.00%     -9.09% 3000.00 2019-04-04

The issue with the fixed solution is that it was not recognizing the space after the price, i.e.:

table(str_count(prices$`PriceRTGS cents`, fixed(" ")))

 2 
55 

But it did using regex for whitespace, i.e.:

table(str_count(prices$`PriceRTGS cents`, regex("\\s")))

 3 
55 

Upvotes: 0

NelsonGon
NelsonGon

Reputation: 13319

An alternative. Separator(-) and date formats,column names can be changed:

prices$Prices<-stringr::str_extract_all(prices$`PriceRTGS cents`,"\\d{1,}.*\\.\\d{1,}",simplify=T)

prices$Dates<-stringr::str_remove_all(prices$`PriceRTGS cents`,"\\d{1,}.*\\.\\d{1,} ")
prices %>% 
  select(-`PriceRTGS cents`) %>% 
  mutate(Dates=lubridate::dmy(Dates))

Result:

               Counter Volume ChangeRTGS cents ChangePercent YTDPercent   Prices      Dates
1        AFDS.zw Afdis      0             0.00         0.00%     10.95%   169.75 2019-04-04
2      ARIS.zw Ariston    572            -0.03        -1.02%     20.83%     2.90 2019-04-04
3 ARTD.zw ART Holdings      0             0.00         0.00%      4.55%     9.20 2019-04-04
4   ASUN.zw Africansun      0             0.00         0.00%     50.00%    15.00 2019-04-04
5         AXIA.zw Axia  8,557             0.00         0.00%    -22.11%    35.05 2019-04-04
6           BAT.zw BAT      0             0.00         0.00%     -9.09% 3,000.00 2019-04-04

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388982

A base R option is to split on empty space and create two parts of the string , first the price part and the remaining part together as date.

t(sapply(strsplit(prices$`PriceRTGS cents`,"\\s+"), function(x) 
  c(x[1], format(as.Date(paste0(x[-1], collapse = "-"), "%d-%b-%y"), "%d/%m/%Y"))))

#         [,1]           [,2]        
#[1,] "169.75"   "04/04/2019"
#[2,] "2.90"     "04/04/2019"
#[3,] "9.20"     "04/04/2019"
#[4,] "15.00"    "04/04/2019"
#[5,] "35.05"    "04/04/2019"
#[6,] "3,000.00" "04/04/2019"

You can then cbind these two columns to the original dataframe.

If you are ok to keep date columns as it is without any formatting, we can let go off and as.Date and format and shorten the code.

t(sapply(strsplit(prices$`PriceRTGS cents`,"\\s+"), function(x) 
             c(x[1], paste0(x[-1], collapse = "-"))))

Upvotes: 0

Pierre-louis Stenger
Pierre-louis Stenger

Reputation: 301

I've just copy and paste your first prices datas into text editor and change the spaces with ";" (I had not seen your data edition yet).

prices <- read.table("dat.txt", sep=";", header=T)

Kind of "quick and dirty" code, but it's working:

str_split_fixed(prices$PriceRTGS.cents," ", 2)
new_prices <- data.frame(prices$Counter, str_split_fixed(prices$PriceRTGS.cents," ", 2), prices$Volume, prices$ChangeRTGS.cents, prices$ChangePercent, prices$YTDPercent)
colnames(new_prices) <- c("Counter", "PriceRTGS_cents", "Date",  "Volume", "ChangeRTGS cents", "ChangePercent",  "YTDPercent")
new_prices$Date <- gsub("Apr", "04", new_prices$Date)
new_prices$Date <- gsub(" ", "/", new_prices$Date)
new_prices <- data.frame(prices$Counter, new_prices$PriceRTGS_cents, new_prices$Date, prices$Volume, prices$ChangeRTGS.cents, prices$ChangePercent, prices$YTDPercent)
colnames(new_prices) <- c("Counter", "PriceRTGS_cents", "Date",  "Volume", "ChangeRTGS cents", "ChangePercent",  "YTDPercent")
new_prices

If you have other months than 'Apr', juste add others lines (eg: if "Nov")

new_prices$Date <- gsub("Nov", "10", new_prices$Date)
new_prices$Date <- gsub(" ", "/", new_prices$Date)

Upvotes: 1

Rushabh Patel
Rushabh Patel

Reputation: 2764

you can do somthing this-

library(data.table)
setDT(dt)
dt[,Date:=sub("^\\S+\\s+", "\\1", `PriceRTGS cents`)]
dt[,cents:=sub("^\\s*(\\S+\\S+).*", "\\1", `PriceRTGS cents`)]

Note- Remove original column later from dt

> dt <- subset(dt,select = -c(`PriceRTGS cents`))
> dt
                Counter Volume ChangeRTGS cents ChangePercent YTDPercent    cents     Date
1:        AFDS.zw Afdis      0             0.00         0.00%     10.95%   169.75 4 Apr 19
2:      ARIS.zw Ariston    572            -0.03        -1.02%     20.83%     2.90 4 Apr 19
3: ARTD.zw ART Holdings      0             0.00         0.00%      4.55%     9.20 4 Apr 19
4:   ASUN.zw Africansun      0             0.00         0.00%     50.00%    15.00 4 Apr 19
5:         AXIA.zw Axia  8,557             0.00         0.00%    -22.11%    35.05 4 Apr 19
6:           BAT.zw BAT      0             0.00         0.00%     -9.09% 3,000.00 4 Apr 19

EDIT- If you want Date as you mentioned then do this-

dt[,Date:=format(as.Date(sub("^\\S+\\s+", "\\1", `PriceRTGS cents`),format='%d %b %Y'),"%d/%m/%Y")]

Upvotes: 0

Related Questions