Reputation: 2333
PROBLEM: Ruby 2.0 CSV reader on Mac Mavericks treats Microsoft Excel generated CSV files that have embedded HTML differently. Works fine on Ruby 1.8 with FasterCSV.
I just upgraded my Mac to Mavericks (OS X 10.9.4) and also upgraded Ruby to 2.0.0p451 (I used to use Ruby 1.8+ with the FasterCSV gem but now use Ruby 2.0+ with it's native CSV.)
Ruby Version:
ruby -v
ruby 2.0.0p451 (2014-02-24 revision 45167) [universal.x86_64-darwin13]
The CSV file is generated from Office 2011, saved from an original ".xlsx" file.
The following HTML is contained in a single cell of the Microsoft .xlsx file BEFORE it is saved as CSV...
<h1 style="text-align:center; font: bold 1.5em Arial;">This is the Title</h1>
<p style="text-align:center;"><img style="width:300px; height:100px" src="./IMAGES/MAIN/image1.png" alt="Image 1"/></p>
<p style="text-align:center;">This is a sentence.</p>
There are other cells, that also have HTML code embedded.
To reproduce...
You will now have an Excel generated CSV file that has two formal CSV rows, where each row will have multiple HTML constructs that are separated by line feeds, within it.
Reading the CSV File...
I use the following code to read CSV file and print the contents of each cell, both before and after I try to strip control characters...
arrayOfHtmlConstructs = CSV.read( file.csv )
arrayOfHtmlConstructs.each_with_index do | construct, i|
output = "" << construct.to_s
puts "BEFORE: " << output
output = output.gsub(/\r/, "") # Replace Microsoft carriage returns FAILS!
output = output.gsub(/\\"/, "\"") # Replace escaped quotes with quotes WORKS FINE!
output = output.gsub(/\[\"/, "") # Remove prefix [" WORKS FINE!
output = output.gsub(/\"\]/, "") # Remove suffix "] WORKS FINE!
puts "AFTER: " << output
end
Before trying to strip code, the CSV string "output" looks as follows...
BEFORE: ["<h1 style=\"text-align:center; font: bold 1.5em Arial;\">This is the Title</h1>\r<p style=\"text-align:center;\"><img style=\"width:300px; height:100px\" src=\"./IMAGES/MAIN/image1.png\" alt=\"Image 1\"/></p>\r<p style=\"text-align:center;\">This is a sentence.</p>"]
You'll notice that it includes [" at the beginning and ]" at the end, along with escaped quotes and embedded carriage returns /r
PROBLEM: All of the gsub statements work except for the one that tries to replace all carriage returns with blanks.
After running the Ruby script, the string "output" looks as follows, where everything gets substituted properly, except for the carriage returns...
AFTER: <h1 style="text-align:center; font: bold 1.5em Arial;">This is the Title</h1>\r<p style="text-align:center;"><img style="width:300px; height:100px" src="./IMAGES/MAIN/image1.png" alt="Image 1"/></p>\r<p style="text-align:center;">This is a sentence.</p>
For some reason, the carriage returns are NOT being replaced/substituted.
Also, before I upgraded to Ruby 2.0, I used to use FasterCSV and none of the substitution statements were needed. Everything just worked.
Any thoughts as to why this is all happening and how to properly handle it? Any assistance is greatly appreciated.
Upvotes: 1
Views: 325
Reputation: 15967
The scope of my answer has changed so I've edited down to just the RegEx as that seems to be more on topic.
I've updated my expression to cover all of your substitutions, simply update with this block of code:
arrayOfHtmlConstructs.each_with_index do | construct, i|
output = "" << construct.to_s
puts "BEFORE: " << output
output = output.gsub(/\\"/, "\"") # Replace escaped quotes with quotes WORKS FINE!
output = output.gsub(/(\\r|\[|\])/, "")
puts "AFTER: " << output
end
Upvotes: 2
Reputation: 13862
Try this:
@csv = CSV.read(params[:file].path, headers: true, skip_blanks: true, encoding:'windows-1256:utf-8')
You need to do the Microsoft CSV encoding
Upvotes: 1