Reputation: 7414
Consider the following String, which is a table of content extracted from a pdf, like in the following example, two topics can be on one line, there is one line break at the end of each line (like in the example)
A — N° 1 2 janvier 2013
TABLE OF CONTENT
Topic à one ......... 30 Second Topic .......... 33
Third - one ......... 3 Topic.with.dots .......... 33
One more line ......................... 27 last topic ...... 34
I want to extract the section's name 'Topic à one', 'Second Topic', 'Third -one', 'Topic.with.dots', 'One more line' and 'last topic'
Any insights for a matching regex?
Upvotes: 0
Views: 230
Reputation: 336208
The following (unoptimized yet) regex works on your example:
(?i)(?=[A-Z])(?:\.[A-Z-]+|[A-Z -]+)+\b
It needs improvements, though, for example if non-ASCII letters should be matched, and there are some possible performance optimizations that depend on the exact regex flavor being used.
For Ruby 2, I would suggest /(?=\p{L})(?:\.[\p{L}-]++|[\p{L} -]+)+\b/
Upvotes: 1
Reputation: 6368
# -*- coding: utf-8 -*-
string = "A — N° 1 2 janvier 2013
TABLE OF CONTENT
Topic à one ......... 30 Second Topic .......... 33
Third - one ......... 3 Topic.with.dots .......... 33
One more line ......................... 27 last topic ...... 34"
puts string.scan(/(\p{l}[\p{l} \.-]*)\s+\.+\s+\d+/i).flatten
This does what you want. It also matches single letter titles.
Upvotes: 2
Reputation: 1526
Here is a solution in Perl:
$ cat tmp
Topic one ......... 30 Second Topic .......... 33 Third one ......... 3 Topic.with.dots .......... 33 One more line ......................... 27 last topic ...... 34
$ cat tmp | perl -ne 'while (m/((?:\w|[. ])+?) [.]+ \d+/g) { print "$1\n" }'
Topic one
Second Topic
Third one
Topic.with.dots
One more line
last topic
A little explanation of what I am doing here, the inner set of parens (?:...)
are non capturing, so they are only for grouping, and they group a word-char (\w
) or a space or dot [. ]
and then, since you have more dots, the match is non-greedy +?
and the whole match goes into $1
, which is printed.
HTH
--EDIT--
Ruby has almost all constructs of Perl, including regex, and it is a straight forward conversion! (not sure why it had to be voted down!) FWIW, here it is in Ruby:
while ARGF.gets
puts $_.scan(/((?:\w|[. ])+?) [.]+ \d+/)
end
Upvotes: -1
Reputation: 160551
Similar to @sawa's:
puts text.scan(/([a-zA-Z .]+?) \.\.++ \d+/).flatten.map(&:strip)
# >> Topic one
# >> Second Topic
# >> Third one
# >> Topic.with.dots
# >> One more line
# >> last topic
(I like his pattern better though.)
Upvotes: 1
Reputation: 168121
string.scan(/(\S.*?)\s+\.{2,}\s+\d+/).flatten
# =>
[
"Topic one",
"Second Topic",
"Third one",
"Topic.with.dots",
"One more line",
"last topic"
]
Upvotes: 1