Reputation: 2293
I have around 100 text files (but expected to grow) that I need to extract the containing data from.
The text files have a format at the moment two specific format like (but expected to change in the future):
From: sender name
Sent: 16 May 2017 15:54
To: receiver date
Subject: Text
Task: task1
Date: 'APR-17'
Entity: '1234'
Account: '%'
Branch: '%'
CostCenter: '%'
Product: '%'
InterCo: '%'
or
From: sender name
Sent: 16 May 2017 15:54
To: receiver date
Subject: Text
Task: task2
Date: APR-17
Entity: ename
What is the best way to extract data in R to convert it into a structure dataset to analyse it?
Is there a specific library or function I could take advantage of? Are there any examples I could get started from?
Upvotes: 0
Views: 1860
Reputation: 18425
I would do something like this. You might need to modify it depending on your data.
library(stringr) #for splitting and trimming raw data
library(tidyr) #for converting to wide format
#read files into a list of vectors (assuming filenames is a vector of names of your text files)
datalist <- lapply(filenames,readLines)
#convert each element of the list into a data frame
datalist <- lapply(1:length(datalist),function(i) data.frame(
caseno=i, #to identify source of each line
rawdata=datalist[[i]],
stringsAsFactors = FALSE))
#bind these into a single data frame
df <- do.call(rbind,datalist)
#split the rawdata at the first ':' into type and entry, and trim spaces
df[,c("type","entry")] <- str_trim(str_split_fixed(df$rawdata,":",2))
#convert from 'long' to 'wide' format - the types become column headings
df <- df[,c("caseno","type","entry")]
df <- spread(df,key=type,value=entry)
df
should be a single data frame containing a case no, and the values of each entry type as columns. It will probably need a little tidying up afterwards - stringr
will be useful for that.
Upvotes: 3