Parsing CSV file. Separating into more files

Question

I am currently working on a small application that will allow someone to enter strings that'll separate a larger file into smaller files.

I am having trouble getting the file to be split into the newer files. This is my code for my CSV method:

require 'csv'

new_list = []
old_list = []

def import_csv(file)
  puts "Method begins"
  CSV.foreach("public_schools_list.csv", :encoding => 'windows-1251:utf-8', :headers => true) do |row|
    if row["company"].downcase.include?("academy" || "lutheran" || "jewish" || "private" || "christian")
      CSV.open("new_list.csv", "ab") do |n|
        n << row

        puts "First Row"

        new_list << n
      end
    else
      CSV.open("old_list.csv", "ab") do |o|
       o << row

       puts "Second Row"

       old_list << o
      end
    end
  end
end

puts "New Csv: #{new_list.count}"
puts "Old Csv: #{old_list.count}"

I am just trying to check this code to see if it is splitting the files. I'm not sure if some of this is correct. There are currently only four items in the CSV list. I use the count method to check if they go into the correct files.

What code am I missing to get the main file to be separated into two?

here is my controller:

include CSVUpload

def import
  csv_separate(params[:file].tempfile)
  redirect_to root_url
end

then here is the module i am using:

require 'csv'
module CSVUpload

    def csv_separate(file)
        new_list_counter = 0
        old_list_counter = 0

        # puts "Method begins"
        CSV.open("new_list.csv", "ab") do |new_list|
        CSV.open("old_list.csv", "ab") do |old_list|  
            CSV.foreach(file, :encoding => 'windows-1251:utf-8', :headers => true) do |row|
              if row["company"][/\b(?:academy|lutheran|jewish|private|christian)\b/i]
                new_list << row

                new_list_counter += 1
              else
                old_list << row

                old_list_counter += 1
              end
            end
          end
        end
      end
    end

and then here is the form:


  Import a CSV File
   <%= form_tag({action: "import"}, multipart: true) do %>
   <%= file_field_tag("file") %>
   <%= submit_tag "Import CSV" %>
 <% end %>

i hope that helps out a bit. Thanks!

the Tin Man · Accepted Answer

Don't open then close the output files for every line you read. That's horribly inefficient and wastes CPU time. Open them outside the CSV.foreach loop, then write to them conditionally.

Also, don't aggregate the lines of the file in memory just so you can count them. Instead, increment a counter.

Also, include? doesn't work this way:

include?("academy" || "lutheran" || "jewish" || "private" || "christian")

The documentation says:

  str.include? other_str   -> true or false

------------------------------------------------------------------------------

Returns true if str contains the given string orcharacter.

  "hello".include? "lo"   #=> true
  "hello".include? "ol"   #=> false
  "hello".include? ?h     #=> true

Notice that it takes a single character or string. Using an or'd list of strings results in only the first string:

"academy" || "lutheran" || "jewish" || "private" || "christian" # => "academy"

As a result, only "academy" would ever be tested for inclusion.

This is untested but looks right enough:

require 'csv'


def import_csv(file)
  new_list_counter = 0
  old_list_counter = 0

  puts "Method begins"

  CSV.open("new_list.csv", "ab") do |new_list|
    CSV.open("old_list.csv", "ab") do |old_list|

      CSV.foreach("public_schools_list.csv", :encoding => 'windows-1251:utf-8', :headers => true) do |row|

        if row["company"][/\b(?:academy|lutheran|jewish|private|christian)\b/i]
          new_list << row

          puts "First Row"

          new_list_counter += 1

        else
          old_list << row

          puts "Second Row"

          old_list_counter += 1
        end

      end
    end

  end

  puts "New CSV: #{ new_list_counter }"
  puts "Old CSV: #{ old_list_counter }"

end

Notice /\b(?:academy|lutheran|jewish|private|christian)\b/i. This tests for multiple sub-strings that are complete words. \b in regex-ese means "word-boundary", so each of the embedded words or'd together using | are tested separately. /i means it's case-insensitive. To accomplish the same thing using include? would involve looping over a list of words, which is one of those algorithms that grows slower and slower as you add more words to test.

row["company"][/\b(?:academy|lutheran|jewish|private|christian)\b/i] is extremely efficient in comparison to that possible loop-algorithm. Using regex patterns aren't a universal panacea. Used incorrectly regex patterns can actually slow your program, sometimes drastically. Used correctly though, they can reduce your code and, in the right situation, when written correctly, can provide dramatic speed improvements. So, check using benchmarks before you blindly introduce them into code.

Parsing CSV file. Separating into more files

Answers (1)

Related Questions