Orsay
Orsay

Reputation: 1130

How to remove word break and line break in pdf file?

I'm trying to parse a pdf file and I would like to get an input without word break at the end of the line, ex :

text.pdf

"hello guys I ne-
ed help"

How to remove the "-" and the line break in order to stick the both part of "need" together

This is my actual code :

reader = PDF::Reader.new(‘text.pdf’)
reader.pages.each do |page|
 page.text.each_line do |line|
   words = line.split(” “) # => ["hello"], ["guys"], ["I"], ["ne-"], ["ed"], ["help"]
    words.each do |word|
      puts word
    end
 end

Upvotes: 1

Views: 815

Answers (1)

Andrey Deineko
Andrey Deineko

Reputation: 52357

You can use String#gsub:

a = "hello guys I ne-
ed help"
#=> "hello guys I ne-\n" + "ed help"
a.gsub(/-|\n/, '-' => '', "\n" => '')
#=> "hello guys I need help"

With your code:

reader = PDF::Reader.new(‘text.pdf’)
reader.pages.each do |page|
  page.text.each_line { |line| line.gsub(/-|\n/, '-' => '', "\n" => '')}  
end

Or, if dash and new line element are always together substitute them together:

a.gsub(/-\n/, '')
#=> "hello guys I need help"

Upvotes: 1

Related Questions