analyticsPierce
analyticsPierce

Reputation: 3025

Ruby CSV.open need to remove quotes and null characters

I am retrieving a large hash of results from a database query and writing them to a csv file. The code block below takes the results and creates the CSV. With the quote_char: option it will replace the quotes with NULL characters which I need to properly create the tab-delimited file.

However, the NULL characters are getting converted into "" when they are loaded into their destination so I would like to remove those. If I leave out quote_char: every field is double quoted which causes the same result.

How can I remove the NULL characters?

begin
    CSV.open("#{file_path}"'file.tab', "wb", Options = {col_sep: "\t", quote_char: "\0"}) do |csv|
        csv << ["Key","channel"]           
        series_1_results.each_hash do |series_1|
         csv << ["#{series_1['key']}","#{series_1['channel']}"]
        end
    end
end

Upvotes: 9

Views: 10096

Answers (4)

Subhas
Subhas

Reputation: 14408

From the Ruby CSV docs, setting force_quotes: false in the options seems to work.

CSV.open("#{file_path}"'file.tab', "wb", { col_sep: "\t", force_quotes: false }) do |csv|

The above does the trick. I'd suggest against setting quote_char to \0 since that doesn't work as expected.

There is one thing to note though. If the field is an empty string "" - it will force the quote_char to be printed into the CSV. But strangely a nil value does not. I'd suggest that if at all you're expecting empty strings in the data, you somehow convert them into nil when writing to the CSV (maybe using the ActiveSupport presence method or anything similar).

Upvotes: 3

Legat
Legat

Reputation: 1449

As it is stated in the csv documentation you have to the set quote_char to some character, and this character will always be used to quote empty fields.

It seems the only solution in this case is to remove used quote_chars from the created csv file. You can do it like this:

quotedFile = File.read("#{file_path}"'file.tab')
unquotedFile = quotedFile.gsub("\0", "")
File.open("#{file_path}"'unquoted_file.tab',"w") { |file| file.puts replace }

I assume here that NULL's are the only escaped fields. If that's not the case use default quote_char: '"' and gsub(',"",', '') which should handle almost all possible cases of fields containing special characters.

But as you note that the results of your query are large it might be more practical to prepare the csv file on your own and avoid processing the outputs twice. You could simply write:

File.open("#{file_path}"'unquoted_file.tab',"w") do |file|
    csv.puts ["Key","channel"]     
    series_1_results.each_hash do |series_1|
        csv.puts ["#{series_1['key']},#{series_1['channel']}"]
    end
end

Once more, you might need to handle fields with special characters.

Upvotes: 4

davogones
davogones

Reputation: 7399

If your input contains any data that needs to be escaped (such as the column separator or the quote character), then you do need to quote your data. Otherwise it cannot be parsed correctly later.

CSV.open('test.csv', 'wb', col_sep: "\t") do |csv|
  csv << ["test", "'test'", '"test"', nil, "test\ttest"]
end

puts open('test.csv').read
#test    'test'  """test"""              "test   test"

The CSV class won't quote anything unnecessarily (as you can see above). So I'm not sure why you're saying all your fields are being quoted. It could be somehow force_quotes is getting set to true somewhere.

If you're absolutely certain your data will never contain \t or ", then the default quote_char (") should work just fine. Otherwise, if you want to avoid quoting anything, you'll need to pick another quote character that you're absolutely certain won't occur in your data.

CSV.open('test.csv', 'wb', col_sep: "\t", quote_char: "|") do |csv|
  csv << ["test", "'test'", nil, '"test"']
end

puts open('test.csv').read
#test    'test'          "test"

Upvotes: 1

the Tin Man
the Tin Man

Reputation: 160571

First, a tab-separated file is "TSV", vs. a comma-separated file which is "CSV".

Wrapping quotes around fields is necessary any time there could be an occurrence of the field delimiter inside a field.

For instance, how are you going to embed this string in a tab-delimited file?

Foo\tbar

The \t is the representation of an embedded Tab.

The same problem occurs when writing a CSV file with a field containing commas. The field has to be wrapped in double-quotes to delimit the field itself.

Upvotes: 1

Related Questions