Reputation: 2853
I am new to R. I am trying to learn basic data I/o and preprocessing. I have a text file of the format given below. It is a non standard format (unlike CSV,JSON etc) I need to convert the following structure into a table like format (more precisely a dataframe that we obtain from csv files)
Input
product/productId: B000H13270
review/userId: A3J6I70Z9Q0HRX
review/profileName: Lindey H. Magee
review/helpfulness: 1/3
review/score: 5.0
review/time: 1261785600
review/summary: it's fabulous, but *not* from amazon!
review/text: the price on this product certainly raises my attention on compairing amazon price with the local stores. i can get a can of this rotel at my local kroger for $1. dissapointing!
product/productId: B000H13270
review/userId: A1YLOZQKBX3J1S
review/profileName: R. Lee Dailey "Lee_Dailey"
review/helpfulness: 1/4
review/score: 3.0
review/time: 1221177600
review/summary: too expensive
review/text: howdy y'all,<br /><br />the actual product is VERY good - i'd rate the item a 4 on it's own. however, it's only ONE dollar at the local grocery and - @ twenty eight+ dollars per twelve pack - these are running almost two and a half dollars each.<br /><br />as i said, TOO EXPENSIVE. [*sigh ...*] i was really hoping to get them at something approaching the local cost.<br /><br />take care,<br />lee
Output
product/productId | review/UserId ......... | review/text
B000H13270 |A3J6I70Z9Q0HRX | the price on this .... dissapointing!
B000H13270 | A1YLOZQKBX3J1S |howdy y'all,<br /> ..... lee
In Python
I could have performed the same in the following manner
dataFile = open('filename').read().split('\n') # obtain each data chunk
revDict = dict()
for item in dataFile:
stuff = item.split(':')
revDict[stuff[0]].append(stuff[1])
How something similar can be achieved in R
. Are there any equivalents in R
Upvotes: 3
Views: 357
Reputation: 335
Here is a 'poor man' method.
I assume that all blocks of data has the same fields, there is no missing fields, and :
is use only as separator.
You have 8 fields, in the example I use 3 and simplify its names.
fields <- 3
# you can use file="example.txt" instead text=...
data <- read.table(text="
prod: foo 1
rev1: bar 11
rev2: bar 12
prod: foo 2
rev1: bar 21
rev2: bar 22
", sep=":", strip.white=TRUE, stringsAsFactors=FALSE)
rows <- dim(data)[1]/fields
mdata <- matrix(data$V2, nrow=rows, ncol=fields, byrow=TRUE)
colnames(mdata) <- data$V1[1:fields]
as.data.frame(mdata)
Result:
prod rev1 rev2
1 foo 1 bar 11 bar 12
2 foo 2 bar 21 bar 22
Upvotes: 1
Reputation: 34291
Here's a quick and dirty way that splits on colons (all colons except the first on each line are removed from the file) then reshapes the data from long to wide:
mytxt <- readLines(file("mytext.txt"))
mytable <- read.table(text=gsub("^([^:]*:)|:", "\\1", mytxt), sep = ":", quote = "")
mytable$id <- rep(1:(nrow(mytable)/8), each = 8)
res <- reshape(mytable, direction = "wide", timevar = "V1", idvar = "id")
Which gives:
id V2.product/productId V2.review/userId V2.review/profileName V2.review/helpfulness V2.review/score V2.review/time V2.review/summary V2.review/text
1 1 B000H13270 A3J6I70Z9Q0HRX Lindey H. Magee 1/3 5.0 1261785600 it's fabulous, but *not* from amazon! the price on this product certainly raises my attention on compairing amazon price with the local stores. i can get a can of this rotel at my local kroger for $1. dissapointing!
9 2 B000H13270 A1YLOZQKBX3J1S R. Lee Dailey \\"Lee_Dailey\\" 1/4 3.0 1221177600 too expensive howdy y'all,<br /><br />the actual product is VERY good - i'd rate the item a 4 on it's own. however, it's only ONE dollar at the local grocery and - @ twenty eight+ dollars per twelve pack - these are running almost two and a half dollars each.<br /><br />as i said, TOO EXPENSIVE. [*sigh ...*] i was really hoping to get them at something approaching the local cost.<br /><br />take care,<br />lee
Assumes that each case consists of 8 lines.
Upvotes: 1
Reputation: 24945
There are a lot of ways of doing this. Here's how I would do it, using readLines
, tidyr
and dplyr
:
library(dplyr)
library(tidyr)
con <- file("mytxt.txt", "r", blocking = FALSE)
z <- readLines(con)
z <- as.data.frame(z) %>% separate(z, into = c("datatype", "val"), sep=": ") %>%
mutate(rep = cumsum(datatype=="product/productId")) %>%
na.omit() %>%
spread(datatype, val)
You'll get an output in a dataframe like:
rep product/productId review/helpfulness review/profileName review/score
1 1 B000H13270 1/3 Lindey H. Magee 5.0
2 2 B000H13270 1/4 R. Lee Dailey "Lee_Dailey" 3.0
review/summary
1 it's fabulous, but *not* from amazon!
2 too expensive
review/text
1 the price on this product certainly raises my attention on compairing amazon price with the local stores. i can get a can of this rotel at my local kroger for $1. dissapointing!
2 howdy y'all,<br /><br />the actual product is VERY good - i'd rate the item a 4 on it's own. however, it's only ONE dollar at the local grocery and - @ twenty eight+ dollars per twelve pack - these are running almost two and a half dollars each.<br /><br />as i said, TOO EXPENSIVE. [*sigh ...*] i was really hoping to get them at something approaching the local cost.<br /><br />take care,<br />lee
review/time review/userId
1 1261785600 A3J6I70Z9Q0HRX
2 1221177600 A1YLOZQKBX3J1S
Upvotes: 1