Reputation: 985
I need to read a csv file which has broken rows for some reasons. The are about 60,000 rows and some of then are just broken from the previous raw. I would like to find how i can read the table and convert it into a proper dataframe with
I am reading the file this way:
All_transactions <- read.csv(paste("/Users/Match/Data/MenuReport/", 04-01-new_file.csv, sep=""), skip=6, sep=",")
I am skipping the first 6 rows which contain random text.
Product,Date,Quantity,Categorie,sector
ABC, 01052019, 4510, Food, Dry
CDE, 01052019, 222, Drink
, Cold
FGH, 01052019, 345, Food, Dry
IJK, 01052019, 234, Food
, Cold
I did notice that the wrong rows seem to start with a comma
I would like to be able to clean them this way:
Product,Date,Quantity,Categorie,sector
ABC, 01052019, 4510, Food, Dry
CDE, 01052019, 222, Drink, Cold
FGH, 01052019, 345, Food, Dry
IJK, 01052019, 234, Food, Cold
Then put them in a dataframe.
Upvotes: 2
Views: 1049
Reputation: 313
The other solutions are probably better, but you could also use a monstrous piece of function code like this (this relies heavily on the rest of your data following your sample data pattern):
library(readr)
df <- read_csv(file = "YOUR_FILE", skip = 6)
df
process_df <- function(x) {
for (row in 1:nrow(x)) {
if(sum(is.na(x[row,]) == 1)) {
if (rowSums(!is.na(x[row+1,])) == 1) {
x[row, which(is.na(x[row,]))] <- x[row+1,which(!is.na(x[3,]))]
}
}
}
x <- x[rowSums(!is.na(x[,])) > 1,]
return(x)
}
process_df(df)
Upvotes: 1
Reputation: 161
Simple solution using base R: Read using readLines, skip first 6, and process further:
dat = readLines('your_file')
dat = dat[7:length(dat)]
csv_dat = read.csv(textConnection(dat[!grepl("^,",dat)]))
Upvotes: 1
Reputation:
The easiest way would be to read in the contents of the CSV as single character string using readr
s read_file
, then replace pattern newline + comma with a comma:
library(readr)
# Read in broken CSV as single character string.
file_string <- read_file("broken_csv.csv")
# Replace patter `\\n,` with `,`, then read string as CSV.
df <- read_csv(gsub("\\n,", ",", file_string), skip = 6)
df
#### OUTPUT ####
# A tibble: 4 x 5
Product Date Quantity Categorie sector
<chr> <chr> <dbl> <chr> <chr>
1 ABC 01052019 4510 Food Dry
2 CDE 01052019 222 Drink Cold
3 FGH 01052019 345 Food Dry
4 IJK 01052019 234 Food Cold
Upvotes: 3
Reputation: 6441
There are probably several ways to do this..
UPDATE: Try this then. With the skip=
argument in scan()
you can specify how many rows to skip.
file <- scan("C:/Users/skupfer/Documents/bisher.txt", strip.white = TRUE, sep = ",",
what = list("character"), skip = 1)
file_mat <- matrix(file[[1]][file[[1]] != ""], ncol = 5, byrow = TRUE)
file_df <- as.data.frame(file_mat, stringsAsFactors = FALSE)
file_df$Quantity <- as.integer(file_mat[,3])
> file_df
Product Date Quantity Categorie sector
1 ABC 01052019 4510 Food Dry
2 CDE 01052019 222 Drink Cold
3 FGH 01052019 345 Food Dry
4 IJK 01052019 234 Food Cold
Upvotes: 2