Selrac
Selrac

Reputation: 2293

R Text file data extraction

I have around 100 text files (but expected to grow) that I need to extract the containing data from.

The text files have a format at the moment two specific format like (but expected to change in the future):

From:   sender name
Sent:   16 May 2017 15:54
To: receiver date
Subject:    Text

Task: task1
Date: 'APR-17'
Entity: '1234'
Account: '%'
Branch: '%'
CostCenter: '%'
Product: '%'
InterCo: '%'

or

From:   sender name
Sent:   16 May 2017 15:54
To: receiver date
Subject:    Text

Task: task2
Date: APR-17
Entity: ename

What is the best way to extract data in R to convert it into a structure dataset to analyse it?

Is there a specific library or function I could take advantage of? Are there any examples I could get started from?

Upvotes: 0

Views: 1860

Answers (1)

Andrew Gustar
Andrew Gustar

Reputation: 18425

I would do something like this. You might need to modify it depending on your data.

library(stringr) #for splitting and trimming raw data
library(tidyr) #for converting to wide format

#read files into a list of vectors (assuming filenames is a vector of names of your text files)
datalist <- lapply(filenames,readLines)

#convert each element of the list into a data frame
datalist <- lapply(1:length(datalist),function(i) data.frame(
                          caseno=i, #to identify source of each line
                          rawdata=datalist[[i]],
                          stringsAsFactors = FALSE))

#bind these into a single data frame
df <- do.call(rbind,datalist)

#split the rawdata at the first ':' into type and entry, and trim spaces
df[,c("type","entry")] <- str_trim(str_split_fixed(df$rawdata,":",2))

#convert from 'long' to 'wide' format - the types become column headings
df <- df[,c("caseno","type","entry")]
df <- spread(df,key=type,value=entry)

df should be a single data frame containing a case no, and the values of each entry type as columns. It will probably need a little tidying up afterwards - stringr will be useful for that.

Upvotes: 3

Related Questions