Reputation: 23
I have a data frame and the first column, Raw Data, reads like this:
Raw Data
USGS 162 1994-10-15 14 A
USGS 162 1994-10-16 49 A
USGS 162 1994-10-17 39 A
......
I'm trying to make a new data frame that now has two columns instead of 1. The first column would have date and the second would have the integer values so it would look like this:
Date Integer
1994-10-15 14
I know you can use strptime() and format() to extract year, month, day but I'm not sure how this works when you have extra numbers and characters in the cell. Thanks.
Upvotes: 2
Views: 937
Reputation: 887611
You could use read.table
.
res <- read.table(text=df$RawData, header=FALSE, sep='',
colClasses=c(NA, NA, 'Date', 'integer'), col.names=c('', '',
'Date', 'Integer', ''))[3:4]
res
# Date Integer
#1 1994-10-15 14
#2 1994-10-16 49
#3 1994-10-17 39
Or using cSplit
from splitstackshape
. The 'Date' column class can be changed to 'Date' afterwards with as.Date
library(splitstackshape)
setnames(cSplit(df, 'RawData', sep=' ', type.convert=TRUE)[,3:4,
with=FALSE], c('Date', 'Integer'))[]
Or
library(tidyr)
extract(df, 'RawData', into= c('Date', 'Integer'),
'\\S*\\s*\\S*\\s*(\\S*)\\s*(\\S*).*', convert=TRUE)
Or
library(data.table)#v1.9.5+
setnames(setDT(df)[, tstrsplit(RawData, ' +',
type.convert=TRUE)[3:4]], c('Date', 'Integer'))[]
NOTE: 'df' from @bgoldst's post
Upvotes: 3
Reputation: 35324
Here's one approach:
df <- data.frame(RawData=c('USGS 162 1994-10-15 14 A','USGS 162 1994-10-16 49 A','USGS 162 1994-10-17 39 A'), stringsAsFactors=F );
df;
## RawData
## 1 USGS 162 1994-10-15 14 A
## 2 USGS 162 1994-10-16 49 A
## 3 USGS 162 1994-10-17 39 A
df2 <- do.call(rbind, lapply(strsplit(df$RawData,'\\s+'), function(x) data.frame(Date=as.Date(x[3]), Integer=as.integer(x[4]) ) ) );
df2;
## Date Integer
## 1 1994-10-15 14
## 2 1994-10-16 49
## 3 1994-10-17 39
Since your dates are already in YYYY-mm-dd format, there's actually no need for strptime()
or format()
or anything; you can coerce the string directly to Date type with as.Date()
.
The only challenge is to extract the relevant pieces of text from the input strings that contain the date and integer values. I made the assumption that the input data is always whitespace-delimited, because it appears that way in your question. My solution splits the input strings into fields on whitespace usting strsplit(df$RawData,'\\s+')
. This results in a list of character vectors. The lapply()
call then iterates over each vector and extracts and coerces the two fields to Date and integer type, combined into a one-row data.frame. Finally, the do.call(rbind, ... )
combines all such one-row data.frames into a single data.frame.
Upvotes: 1
Reputation: 31181
A solution for fun with gsub
(convert the column to integer with as.integer
if needed):
x = unlist(df)
data.frame(
Date = gsub(".*(\\d{4}-\\d{2}-\\d{2}).*","\\1",x),
Integer = gsub(".*(\\d{4}-\\d{2}-\\d{2})[ ]+(\\d+).*","\\2",x)
)
# Date Integer
#RawData1 1994-10-15 14
#RawData2 1994-10-16 49
#RawData3 1994-10-17 39
Data:
df = structure(list(RawData = c("USGS 162 1994-10-15 14 A",
"USGS 162 1994-10-16 49 A", "USGS 162 1994-10-17 39 A"
)), .Names = "RawData", row.names = c(NA, -3L), class = "data.frame")
Upvotes: 1