Reputation: 342
I am fairly new to pandas/r, and I'm not quite sure how to go about reading this data into pandas
or r
for analysis.
Currently, I was thinking I could use readr's read_chunkwise
, or pandas's chunksize
, but this may not be what I need. Is this really something that is easily solved with a for loop or using purr to iterate over all the elements?
Data:
wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856
wine/variant: Red Rhone Blend
wine/year: 1981
review/points: 96
review/time: 1160179200
review/userId: 1
review/userName: Eric
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.
wine/name: 1995 Château Pichon-Longueville Baron
wine/wineId: 3495 wine/variant: Red Bordeaux Blend
wine/year: 1995
review/points: 93
review/time: 1063929600
review/userId: 1
review/userName: Eric
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
Currently, this is what I have as a function, but I am running into an error:
>
convertchunkfile <- function(df){ for(i in 1:length(df)){
> #While the length of any line is not 0, process it with the following loop
> while(nchar(df[[i]]) != 0){
> case_when(
>
> #When data at x index == wine/name, then extract the data after that clause
> #Wine Name parsing
> cleandf$WineName[[i]] <- df[i] == str_sub(df[1],0, 10) ~ str_trim(substr(df[1], 11, nchar(df[1]))),
> #Wine ID parsing
> cleandf$WineID[[i]] <- df[i] == str_sub(df[2],0,11) ~ str_trim(substr(df[2], 13, nchar(df[1])))
> #same format for other attributes
> )
> }
> }
> }
Error in cleandf$BeerName[[i]] <- df[i] == str_sub(df[1], 0, 10) ~ str_trim(substr(df[1], :
more elements supplied than there are to replace
EDIT:
After working through some issues, I think this is probably the best solution, drawing from @hereismyname's solution:
#Use Bash's iconv to force convert the file in OS X
iconv -c -t UTF-8 cellartracker-clean.txt > cellartracker-iconv.txt
#Check number of lines within the file
wc -l cellartracker-iconv.txt
20259950 cellartracker-iconv.txt
#Verify new encoding of the file
file -I cellartracker-clean.txt
ReadEmAndWeep <- function(file, chunk_size) {
f <- function(chunk, pos) {
data_frame(text = chunk) %>%
filter(text != "") %>%
separate(text, c("var", "value"), ":", extra = "merge") %>%
mutate(
chunk_id = rep(1:(nrow(.) / 9), each = 9),
value = trimws(value)
) %>%
spread(var, value)
}
read_lines_chunked(file, DataFrameCallback$new(f), chunk_size = chunk_size)
}
#Final Function call to read in the file
dataframe <- ReadEmAndWeep(file, chunk_size = 100000)
Upvotes: 4
Views: 539
Reputation: 193637
Here's the approach I would suggest:
y <- readLines("your_file")
y <- unlist(strsplit(gsub("(wine\\/|review\\/)", "~~~\\1", y), "~~~", TRUE))
library(data.table)
dcast(fread(paste0(y[y != ""], collapse = "\n"), header = FALSE)[
, rn := cumsum(V1 == "wine/name")], rn ~ V1, value.var = "V2")
The only assumption made is that the first line for each new wine starts with wine/name
. Blank lines and so on don't matter.
Here are two datasets to try this out with.
Substitute "your_file" in the first line of code with either url1
or url2
to try it out.
url1 <- "https://gist.githubusercontent.com/mrdwab/3db1f2d6bf75e9212d9e933ad18d2865/raw/7376ae59b201d57095f849cab079782efb8ac827/wines1.txt"
url2 <- "https://gist.githubusercontent.com/mrdwab/3db1f2d6bf75e9212d9e933ad18d2865/raw/7376ae59b201d57095f849cab079782efb8ac827/wines2.txt"
Note that the second dataset is missing a value for wine/variant:
for the first wine.
It would probably be even better to do the gsub
in awk or something like that and fread
directly on that.
Upvotes: 0
Reputation: 152
Here's a method that's fairly idiomatic in R:
library(readr)
library(tidyr)
library(dplyr)
out <- data_frame(text = read_lines(the_text)) %>%
filter(text != "") %>%
separate(text, c("var", "value"), ":", extra = "merge") %>%
mutate(
chunk_id = rep(1:(nrow(.) / 9), each = 9),
value = trimws(value)
) %>%
spread(var, value)
Upvotes: 2
Reputation: 49832
Here is some code to read those records into a pandas.DataFrame
. These records are structured like a yaml
record so this code uses that fact to its advantage. A blank line is used as a record separator.
import pandas as pd
import collections
import yaml
def read_records(lines):
# keep track of the columns in an ordered set
columns = collections.OrderedDict()
record = []
records = []
for line in lines:
if line:
# gather each line of text until a blank line
record.append(line)
# keep track of the columns seen in an ordered set
columns[line.split(':')[0].strip()] = None
# if the line is empty and we have a record, then convert it
elif record:
# use yaml to convert the lines into a dict
records.append(yaml.load('\n'.join(record)))
record = []
# record last record
if record:
records.append(yaml.load('\n'.join(record)))
# return a pandas dataframe from the list of dicts
return pd.DataFrame(records, columns=list(columns.keys()))
Test Code:
print(read_records(data))
Results:
wine/name wine/wineId \
0 1981 Château de Beaucastel Châteaune... 18856
1 1995 Château Pichon-Longueville Baron 3495
wine/variant wine/year review/points review/time review/userId \
0 Red Rhone Blend 1981 96 1160179200 1
1 Red Bordeaux Blend 1995 93 1063929600 1
review/userName review/text
0 Eric Olive, horse sweat, dirty saddle, and smoke. T...
1 Eric A remarkably floral nose with violet and chamb...
Test data:
data = [x.strip() for x in """
wine/name: 1981 Château de Beaucastel Châteauneuf-du-Pape
wine/wineId: 18856
wine/variant: Red Rhone Blend
wine/year: 1981
review/points: 96
review/time: 1160179200
review/userId: 1
review/userName: Eric
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.
wine/name: 1995 Château Pichon-Longueville Baron
wine/wineId: 3495
wine/variant: Red Bordeaux Blend
wine/year: 1995
review/points: 93
review/time: 1063929600
review/userId: 1
review/userName: Eric
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
""".split('\n')[1:-1]]
Upvotes: 1