Reputation: 37
I have a very large data set that for illustrative purposes looks something like the following.
Cust_ID , Sales_Assistant , Store
123 , Mary, Worthington, 22
456 , Jack, Charles , 42
The real data has many more columns and millions of rows. I'm using the following code to import it into R but it is falling over because one or more of the columns has a comma in the data (see Sales_Assistant above).
df <- read.csv("C:/dataextract.csv", header = TRUE , as.is = TRUE , sep = "," , na.strings = "NA" , quote = "" , fill = TRUE , dec = "." , allowEscapes = FALSE , row.names=NULL)
Adding row.names=NULL imported all the data but it split the Sales_Assistant column over two columns and threw all the other data out of alignment. If I run the code without this I get an error...
Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed
...and the data won't load.
Can you think of a way around this that doesn't involve tackling the data at source, or opening it in a text editor? Is there a solution in R?
Upvotes: 0
Views: 1853
Reputation: 11
First and foremost, it is a csv file. "Mary, Worthington" is meant to respond to two columns. If you have commas in your values, consider saving the data by using tsv (tab-separated values).
However, if you data has equal amount of commas per row with good alignment in some sense, I would consider ignoring the first row (which is the column names as you read the file) of the data frame and reassigning it proper column names.
For instance, in your case you can replace Sales_Assistant by
Sales_Assistant_First_Name, Sales_Assistant_Last_Name
which makes perfect sense. Then I could basically do
df <- df[-1, ]
colnames(df) <- c("Cust_ID" , "Sales_Assistant_First_Name" , "Sales_Assistant_Last_Name", "Store")
Upvotes: 1
Reputation: 1169
df <- read.csv("C:/dataextract.csv", skip = 1, header = FALSE)
df_cnames <- read.csv("C:/dataextract.csv", nrow = 1, header = FALSE)
df <- within(df, V2V3 <- paste(V2, V3, sep = ''))
df <- subset(df, select = (c("V1", "V2V3", "V4")))
colnames(df) <- df_cnames
It may need some modification depending on the actual source
Upvotes: 0