denisjacquemin
denisjacquemin

Reputation: 7414

Extract data from one big string with regex

Consider the following String, which is a table of content extracted from a pdf, like in the following example, two topics can be on one line, there is one line break at the end of each line (like in the example)

A — N° 1 2 janvier 2013

TABLE OF CONTENT

Topic à one ......... 30 Second Topic .......... 33
Third - one ......... 3 Topic.with.dots .......... 33
One more line ......................... 27 last topic ...... 34

I want to extract the section's name 'Topic à one', 'Second Topic', 'Third -one', 'Topic.with.dots', 'One more line' and 'last topic'

Any insights for a matching regex?

Upvotes: 0

Views: 230

Answers (5)

Tim Pietzcker
Tim Pietzcker

Reputation: 336208

The following (unoptimized yet) regex works on your example:

(?i)(?=[A-Z])(?:\.[A-Z-]+|[A-Z -]+)+\b

It needs improvements, though, for example if non-ASCII letters should be matched, and there are some possible performance optimizations that depend on the exact regex flavor being used.

See it on regex101.

For Ruby 2, I would suggest /(?=\p{L})(?:\.[\p{L}-]++|[\p{L} -]+)+\b/

Upvotes: 1

Chris Wesseling
Chris Wesseling

Reputation: 6368

# -*- coding: utf-8 -*-
string = "A — N° 1 2 janvier 2013

TABLE OF CONTENT

Topic à one ......... 30 Second Topic .......... 33
Third - one ......... 3 Topic.with.dots .......... 33
One more line ......................... 27 last topic ...... 34"
puts string.scan(/(\p{l}[\p{l} \.-]*)\s+\.+\s+\d+/i).flatten

This does what you want. It also matches single letter titles.

Upvotes: 2

Ani
Ani

Reputation: 1526

Here is a solution in Perl:

 $ cat tmp
 Topic one ......... 30 Second Topic .......... 33 Third one ......... 3   Topic.with.dots ..........   33 One more line ......................... 27 last topic ...... 34


$ cat tmp  | perl -ne 'while (m/((?:\w|[. ])+?) [.]+ \d+/g) { print "$1\n" }' 
Topic one
Second Topic
Third one
 Topic.with.dots
One more line
last topic

A little explanation of what I am doing here, the inner set of parens (?:...) are non capturing, so they are only for grouping, and they group a word-char (\w) or a space or dot [. ] and then, since you have more dots, the match is non-greedy +? and the whole match goes into $1, which is printed.

HTH

--EDIT--

Ruby has almost all constructs of Perl, including regex, and it is a straight forward conversion! (not sure why it had to be voted down!) FWIW, here it is in Ruby:

while ARGF.gets
  puts $_.scan(/((?:\w|[. ])+?) [.]+ \d+/)
end

Upvotes: -1

the Tin Man
the Tin Man

Reputation: 160551

Similar to @sawa's:

puts text.scan(/([a-zA-Z .]+?) \.\.++ \d+/).flatten.map(&:strip)
# >> Topic one
# >> Second Topic
# >> Third one
# >> Topic.with.dots
# >> One more line
# >> last topic

(I like his pattern better though.)

Upvotes: 1

sawa
sawa

Reputation: 168121

string.scan(/(\S.*?)\s+\.{2,}\s+\d+/).flatten
# =>
[
  "Topic one",
  "Second Topic",
  "Third one",
  "Topic.with.dots",
  "One more line",
  "last topic"
]

Upvotes: 1

Related Questions