Reputation: 408
I've been practicing machine learning on the task of restoring spaces in a joined text. Since I decided to use the dictionary feature, I searched the web for some ideas of splitting the text based on the dictionary, and I stumbled upon this idea. Based on it, I've written a script that converts the text without spaces to a vertical form needed by the ML tool:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
from math import log
import string
import fileinput
words = open("dictionary.txt", encoding="utf8").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
#original script was pretty basic, so symbols\words not in dictionary
#broke the processing completely. This fixed the problem.
if s[i-1] not in wordcost:
wordcost[s[i-1]] = log((len(words) + 1)*log(len(words)))
c,k = best_match(i)
cost.append(c)
print(cost)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
def char_type(s):
""" Character type function """
if s in string.punctuation:
return "P"
elif s in string.digits:
return "D"
elif s in string.ascii_letters:
return "F"
elif s.isupper():
return "U"
else:
return "R"
def test_to_vert(s):
"""
Transforms regular text into a vertical form.
"""
s = s.rstrip('\n')
orig_sent = s
a = s.lower().replace("ё", "е")
a = infer_spaces(a)
space_indices = []
a = list(a)
for i,k in enumerate(a):
if k == " ":
space_indices.append(i)
orig_sent = list(orig_sent)
for i in space_indices:
orig_sent.insert(i, " ")
orig_sent = "".join(orig_sent)
orig_sent = orig_sent.split(" ")
answer = []
for word in orig_sent:
i = 0
for letter in word:
answer.append(letter + "\t" + letter.lower() + "\t" + \
char_type(letter) + "\t" + str(i) + "|" + str(len(word)))
i += 1
return '\n'.join(answer)
testfile = open("head.txt", encoding="utf8")
output = open("test_python.txt", 'w', newline="\n", encoding="utf8")
for line in testfile:
if line in ['\n', '\r\n']:
output.write('\n')
else:
output.write(test_to_vert(line))
output.write('\n\n')
output.write('\n\n\n')
testfile.close()
output.close()
So far so good, it works. After that I decided to practice my Ruby (I'm relatively new to coding), so I tried to re-write the script (Ruby version):
#!/usr/bin/ruby
#encoding: UTF-8
Encoding::default_internal = "UTF-8"
Encoding::default_external = "UTF-8"
require 'active_support/core_ext'
@wordcost = Hash.new
@count = %x{wc -l dictionary.txt}.split.first.to_i
i = 0
File.readlines("dictionary.txt").each do |line|
line.chomp!
@wordcost[line.mb_chars.downcase.to_s] ||= Math.log((i+1) * Math.log(@count))
i += 1
end
def infer_spaces(s)
@sent = s.chomp
def best_match(i)
result = []
candidates = @cost[0, i].reverse
candidates.each_index do |index|
if @wordcost.has_key?(@sent[i-index-1...i].mb_chars.downcase.to_s)
result << [(candidates[index] + @wordcost[@sent[i-index-1...i].mb_chars.downcase.to_s]), (index + 1)]
else
result << [(candidates[index] + Float::INFINITY), (index + 1)]
end
end
result.sort!
return result[0][0], result[0][1]
end
@cost = [0]
for i in ([email protected])
@wordcost[@sent[i-1].mb_chars.downcase.to_s] ||= Math.log(@count * Math.log(@count))
c, k = best_match(i)
@cost << c
end
out = []
i = @sent.length
while i>0
c, k = best_match(i)
if c != @cost[i]
raise "Something went wrong"
end
out << @sent[i-k...i]
i -= k
end
return out.reverse.join(" ")
end
def char_type(string)
case string
when /[[:punct:]]/
return "P"
when /[[:digit:]]/
return "D"
when /[A-z]/
return "F"
when /[[:upper:]]/
return "U"
else
return "R"
end
end
def test_to_vert(s)
s.chomp!
orig_sent = s
a = s.mb_chars.downcase.to_s
a = infer_spaces(a)
space_indices = []
a = a.split("")
a.each_index do |i|
if a[i] == " "
space_indices << i
end
end
orig_sent = orig_sent.split("")
space_indices.each do |x|
orig_sent.insert(x, " ")
end
orig_sent = orig_sent.join
orig_sent = orig_sent.split
answer = []
orig_sent.each do |word|
letters = word.split("")
letters.each_index do |i|
answer << letters[i] + "\t" + letters[i].mb_chars.downcase.to_s + \
"\t" + char_type(letters[i]) + "\t" + i.to_s + "|" + word.length.to_s
end
end
return answer.join("\n")
end
file = File.open('test_ruby_vert.txt', 'w')
File.readlines("test.txt").each do |line|
if line.chomp.empty?
file.write("\n")
else
file.write(test_to_vert(line))
file.write("\n\n")
end
end
file.close
The rewritten script works, however, it is really slow compared to the Python version (a ~40000-line text is processed in like not more than an hour, a Ruby script worked for hours for now, and it only processed like 15% of the text).
I wonder what could slow it down so much? Could it be that is because of the fact that i need to use "active_support/core_ext" to downcase Cyrillic text in Ruby? Could it be because I don't limit the processing in best_match using maxword? Maybe some other rewrite really messed the script up? Any insight will be really helpful for me.
Upvotes: 1
Views: 223
Reputation: 369458
I didn't take a close look (there's just way too much code in your question to do a detailed examination, you really need to wittle it down to an SSCCE), but a few things jumped out at me.
The most important one is that Language Implementations are designed to make idiomatic, well-factored, well-designed code run fast. Your code, however, looks more like Fortran than Ruby, it is definitely neither idiomatic Ruby nor well-factored.
Some smaller observations:
Here you are needlessly creating lots of string objects:
answer << letters[i] + "\t" + letters[i].mb_chars.downcase.to_s + \
"\t" + char_type(letters[i]) + "\t" + i.to_s + "|" + word.length.to_s
You should prefer mutating a single string using <<
over creating many temporary strings using +
:
answer << ('' << letters[i] << "\t" << letters[i].mb_chars.downcase.to_s <<
"\t" << char_type(letters[i]) << "\t" << i.to_s << "|" << word.length.to_s)
But really, string interpolation is much more idiomatic (and incidentally much faster):
answer << "#{letters[i]}\t#{letters[i].mb_chars.downcase}\t#{char_type(letters[i])}\t#{i}|#{word.length}"
You have a lot of unnecessary return
s in your code. Again, that is non-idiomatic, and also slower. For example here:
def char_type(string)
case string
when /[[:punct:]]/
return "P"
when /[[:digit:]]/
return "D"
when /[A-z]/
return "F"
when /[[:upper:]]/
return "U"
else
return "R"
end
end
This should be written just
def char_type(string)
case string
when /[[:punct:]]/
"P"
when /[[:digit:]]/
"D"
when /[A-z]/
"F"
when /[[:upper:]]/
"U"
else
"R"
end
end
There are other place with unnecessary return
s as well.
Within your infer_spaces
method you define another global method named best_match
. Since infer_spaces
is called by test_to_vert
, which is called inside your readlines
loop, the method will be defined over and over and over again for every line in the file, which means that (since most Ruby implementations nowadays are compiled), it will have to be compiled over and over and over and over again. Each redefinition will also invalidate all previous optimizations such as speculative inlining. Just move the method definition outside of the loop.
IO::readlines
reads the entire file into memory as an array. Then you iterate over the array. You might just as well iterate over the lines of the file directly, using IO::foreach
instead:
File.foreach("test.txt") do |line|
This will avoid loading the entire file into memory at once.
You didn't say which Ruby Implementation you are using. Since you have a fairly hot and tight loop, using an implementation with some sort of hotspot optimizations, polymorphic inline caching, speculative inlining, adaptive optimizations and so on, might make a big difference, especially if you fix the recompilation problem for best_match
. Rubinius and JRuby are good candidates here. Rubinius, for example, has been demonstrated to be faster than hand-optimized C in certain cases!
Note: these are all just micro-optimizations. I didn't actually take a look at your algorithm. You can probably get much more performance by tweaking the algorithm rather than micro-optimize the implementation.
For example: in the Python implementation of best_match
, you use min
to find the minimum element, which is O(n), whereas in Ruby, you sort
and then return the first element, which is O(n * log n).
Upvotes: 3
Reputation: 22
I don't know if this will help you, but a lot of the math packages in Python is implemented in C, and is therefore quite fast.
From http://docs.python.org/2/library/math.html : "The math module consists mostly of thin wrappers around the platform C math library functions"
Maybe the use of the logarithmic function from math is the reason the Python script is so much faster?
Upvotes: 0