Sara
Sara

Reputation: 23

How to extract information from a data frame column to make a new data frame

I have a data frame and the first column, Raw Data, reads like this:

Raw Data
USGS    162        1994-10-15      14      A
USGS    162        1994-10-16      49      A
USGS    162        1994-10-17      39      A
......

I'm trying to make a new data frame that now has two columns instead of 1. The first column would have date and the second would have the integer values so it would look like this:

Date        Integer
1994-10-15  14

I know you can use strptime() and format() to extract year, month, day but I'm not sure how this works when you have extra numbers and characters in the cell. Thanks.

Upvotes: 2

Views: 937

Answers (3)

akrun
akrun

Reputation: 887611

You could use read.table.

 res <- read.table(text=df$RawData, header=FALSE, sep='', 
   colClasses=c(NA, NA, 'Date', 'integer'), col.names=c('', '', 
            'Date', 'Integer', ''))[3:4]
 res
 #        Date Integer
 #1 1994-10-15      14
 #2 1994-10-16      49
 #3 1994-10-17      39

Or using cSplit from splitstackshape. The 'Date' column class can be changed to 'Date' afterwards with as.Date

 library(splitstackshape)
 setnames(cSplit(df, 'RawData', sep=' ', type.convert=TRUE)[,3:4,
              with=FALSE], c('Date', 'Integer'))[]

Or

 library(tidyr)
 extract(df, 'RawData', into= c('Date', 'Integer'), 
         '\\S*\\s*\\S*\\s*(\\S*)\\s*(\\S*).*', convert=TRUE)

Or

 library(data.table)#v1.9.5+
 setnames(setDT(df)[, tstrsplit(RawData, ' +', 
            type.convert=TRUE)[3:4]], c('Date', 'Integer'))[]

NOTE: 'df' from @bgoldst's post

Upvotes: 3

bgoldst
bgoldst

Reputation: 35324

Here's one approach:

df <- data.frame(RawData=c('USGS    162        1994-10-15      14      A','USGS    162        1994-10-16      49      A','USGS    162        1994-10-17      39      A'), stringsAsFactors=F );
df;
##                                        RawData
## 1 USGS    162        1994-10-15      14      A
## 2 USGS    162        1994-10-16      49      A
## 3 USGS    162        1994-10-17      39      A
df2 <- do.call(rbind, lapply(strsplit(df$RawData,'\\s+'), function(x) data.frame(Date=as.Date(x[3]), Integer=as.integer(x[4]) ) ) );
df2;
##         Date Integer
## 1 1994-10-15      14
## 2 1994-10-16      49
## 3 1994-10-17      39

Since your dates are already in YYYY-mm-dd format, there's actually no need for strptime() or format() or anything; you can coerce the string directly to Date type with as.Date().

The only challenge is to extract the relevant pieces of text from the input strings that contain the date and integer values. I made the assumption that the input data is always whitespace-delimited, because it appears that way in your question. My solution splits the input strings into fields on whitespace usting strsplit(df$RawData,'\\s+'). This results in a list of character vectors. The lapply() call then iterates over each vector and extracts and coerces the two fields to Date and integer type, combined into a one-row data.frame. Finally, the do.call(rbind, ... ) combines all such one-row data.frames into a single data.frame.

Upvotes: 1

Colonel Beauvel
Colonel Beauvel

Reputation: 31181

A solution for fun with gsub (convert the column to integer with as.integer if needed):

x = unlist(df)

data.frame(
   Date = gsub(".*(\\d{4}-\\d{2}-\\d{2}).*","\\1",x),
   Integer = gsub(".*(\\d{4}-\\d{2}-\\d{2})[ ]+(\\d+).*","\\2",x)
)

#               Date Integer
#RawData1 1994-10-15      14
#RawData2 1994-10-16      49
#RawData3 1994-10-17      39

Data:

df = structure(list(RawData = c("USGS    162        1994-10-15      14      A", 
"USGS    162        1994-10-16      49      A", "USGS    162        1994-10-17      39      A"
)), .Names = "RawData", row.names = c(NA, -3L), class = "data.frame")

Upvotes: 1

Related Questions