Reputation: 5594
I'm trying to write a word counter for LyX files.
Life is almost very simple as most lines that need to be ignored begin with a \
(I'm prepared to make the assumption that no textual lines begin with backslashes) - however there are some lines that look like real text that aren't, but they are enclosed by \begin_inset
and \end_inset
:
I'm genuine text.
\begin_inset something
I'm not real text
Perhaps there will be more than one line! Or none at all! Who knows.
\end_inset
/begin_layout
I also need to be counted, and thus not removed
/end_layout
Is there a quick way in ruby to strip the (smallest amount of) text between two markers? I'm imagining Regular Expressions are the way forward, but I can't figure out what they'd have to be.
Thanks in advance
Upvotes: 0
Views: 263
Reputation: 21804
You should look at str.scan(). Assuming your text is in the variable s, something like this should work:
s_strip_inset = s.sub!(/\\begin_inset.*?\\end_inset/, "")
word_count = s_strip_inset.scan(/(\w|-)+/).size
Upvotes: 0
Reputation: 8125
gsub could be expensive for longer files (if you're reading in the whole file as string)
so if you have to chunk it anyway, you might want to use a stateful parser
in_block = false
File.open(fname).each_line do |line|
if in_block
in_block = false if line =~ /END_MARKER/
next
else
in_block = true if line =~ /BEGIN_MARKER/
next
end
count_words(line)
end
Upvotes: 1
Reputation: 370465
Is there a quick way in ruby to strip the (smallest amount of) text between two markers?
str = "lala BEGIN_MARKER \nlu\nlu\n END_MARKER foo BEGIN_MARKER bar END_MARKER baz"
str.gsub(/BEGIN_MARKER.*?END_MARKER/m, "")
#=> "lala foo baz"
Upvotes: 3