Reputation: 397
I am trying to read in a huge csv file from R, but I am having troubles since the elements of the columns that is suppose to be in the string format is not separated by quotes and is creating a new row each time there is a new line. My data is delimited by ~.
For example, my data looks something similar to this:
a ~ b ~ c ~ d ~ e
1 ~ name1 ~ This is a paragraph.
This is a second paragraph.
~ num1 ~ num2 ~
2 ~ name2 ~ This is an new set of paragraph.
~ num1 ~ num2 ~
I hope to get something like this:
a | b | c | d | e | ____________________________________________________________________________________ 1 | name1 | This is a paragraph. This is a second paragraph. | num1 | num2 | 2 | name2 | This is a new set of paragraph. | num1 | num2 |
But I end up with something ugly like this:
a | b | c | d | e | __________________________________________________________________________________ 1 | name1 | This is a paragraph. | | | This is a second paragraph | | | | | | num1 | num2 2 | name2 | This is a new set of paragraph. | num1 | num2 |
I tried to set allowEscapes = TRUE in read.csv but that didn't do the trick. My input currently looks like this:
read.csv(filename, header = T, sep = '~', stringAsFactors = F, fileEncoding = "latin1", quote = "", strip.white = TRUE)
My next idea is to insert a quotation after each ~, but I am hoping to see if there are better methods.
Any help would be appreciated.
Upvotes: 2
Views: 1417
Reputation: 193517
Here is an approach in R that depends on (1) ~
being a true delimiter that doesn't appear in any of your paragraphs and (2) ~
appearing at the end of each record.
But first, some sample data (in a way that others can also reproduce your problem).
cat("a ~ b ~ c ~ d ~ e",
"1 ~ name1 ~ This is a paragraph.",
"",
"This is a second paragraph.",
"",
"~ num1 ~ num2 ~",
"",
"2 ~ name2 ~ This is an new set of paragraph.",
"",
"~ num1 ~ num2 ~", sep = "\n", file = "test.txt")
We'll start with readLines
to get the data in. We'll also add a ~
at the end of the header row.
x <- readLines("test.txt")
x[1] <- paste(x[1], "~") ## Add a ~ at the end of the first line
Now, we'll paste
everything into a nice long string.
y <- paste(x, collapse = " ")
Use scan
to quickly "read" the data again, but instead of using the file
argument, we'll use the text
argument and refer to the "y" object we just created. Since the last line ends with a ~
there will be an extra ""
at the end, which we will remove before proceeding.
z <- scan(text = y, what = character(), sep = "~", strip.white = TRUE)
# Read 16 items
z <- z[-length(z)]
Since we now have a character vector, we can easily convert this to a matrix
, and then to a data.frame
. We know the colnames
are the first 5 values, so we'll drop those when creating the matrix
, and reinsert them as the names of the data.frame
.
df <- setNames(data.frame(
matrix(z[6:length(z)], ncol = 5, byrow = TRUE)), z[1:5])
df
# a b c d e
# 1 1 name1 This is a paragraph. This is a second paragraph. num1 num2
# 2 2 name2 This is an new set of paragraph. num1 num2
Upvotes: 2
Reputation: 60060
When I saw this was a text-processing problem, I decided Python would be much easier. Apologies if you aren't familiar with it or don't have access to it:
import csv
all_rows = []
with open('tilded_csv.txt') as in_file:
header_line = next(in_file)
header = header_line.strip().split('~')
current_record = []
for line in in_file:
# Assume that a number at the start of a line
# signals a new record
if line[0].isdigit():
new_record = line.strip()
if current_record:
all_rows.append(current_record.split('~'))
current_record = line.strip()
else:
current_record += line.strip()
# Add the last record
all_rows.append(current_record.split('~'))
with open('standard_csv.csv', 'w') as out_file:
out_csv = csv.writer(out_file, dialect='excel')
out_csv.writerow(header)
for row in all_rows:
out_csv.writerow(row)
Upvotes: 0
Reputation: 121568
Something like this for example :
ll = readLines(textConnection('a ~ b ~ c ~ d ~ e
1 ~ name1 ~ This is a paragraph.
This is a second paragraph.
~ num1 ~ num2 ~
2 ~ name2 ~ This is an new set of paragraph.
~ num1 ~ num2 ~'))
## each line begin with a numeric followed by a space
## I use this pattern to sperate lines
llines <- split(ll[-1],cumsum(grepl('^[0-9] ',ll[-1])))
## add the header to the splitted and concatenated lines
read.table(text=unlist(c(ll[1],lapply(llines,paste,collapse=''))),
sep='~',header=TRUE)
a b c d e
1 name1 This is a paragraph. This is a second paragraph. num1 num2 NA
2 name2 This is an new set of paragraph. num1 num2 NA
Upvotes: 3