Marco Mei
Marco Mei

Reputation: 137

extract sentences by using python regex

I have a markdown file like below:

#2016-12-24
| 单词 | 解释 | 例句 |
| --------- | -------- | --------- |
|**accelerator;**| - | - |
|**compass**| - | - |
|**wheels**| - | - |
|**fabulous**| - | - |
|**sweeping**| - | - |
|**prospect**| - | - |
|**pumpkin**| - | - |
|**trolley**| - | - |
|snapped,**| - | - |
|tip| - | - |
|lap| - | - |
|tether.| - | - |
|damp| - | - |
|triumphant| - | - |
|sarcastic| - | - |
|missed out| - | - |
|sidekick| - | - |
|considerable| - | - |
|Willow.| - | - |
|eagle.| - | - |
|considerably.| - | - |
|flat.| - | - |
|feast| - | - |
|scramble| - | - |
|turned up| - | - |
|rounded off| - | - |
|rat| - | - |
|resembled| - | - |
|By the time she had clambered back into the car,| - | - |
|By the time she had clambered back into the car, they were running very late,| - | - |
|wheeled his trolley| - | - |
|barrier,| - | - |
|bounced| - | - |
|in blazes| - | - |
|clutching| - | - |
|sealed| - | - |
|stunned.| - | - |
|‘We’re stuck,| - | - |
|marched off| - | - |
|accelerator| - | - |
|and the prospect of seeing Fred and George’s jealous faces| - | - |
|protest.| - | - |
|in protest.| - | - |
|horizon,| - | - |
|knuckles| - | - |
|metal| - | - |
|thick| - | - |
|reached the end of its tether.| - | - |
|Artefacts| - | - |
|blurted out.| - | - |
|gaped| - | - |
|I will be writing to both your families tonight.| - | - |
|‘Can you believe our luck, though?’| - | - |
|‘Skip the lecture,’| - | - |
|people’ll be talking about that one for years!’| - | - |
|nudged| - | - |
|‘I know I shouldn’t’ve enjoyed that or anything, but –’| - | - |
|dashed| - | - |

I'd like to extract the sentences like:

  1. By the time she had clambered back into the car,
  2. By the time she had clambered back into the car, they were running very late,
  3. wheeled his trolley
  4. ‘We’re stuck,
  5. and the prospect of seeing Fred and George’s jealous faces
  6. reached the end of its tether.
  7. I will be writing to both your families tonight.
  8. ‘Can you believe our luck, though?’
  9. ‘Skip the lecture,’
  10. people’ll be talking about that one for years!’
  11. ‘I know I shouldn’t’ve enjoyed that or anything, but –’

I tried to do like this in regex101 website, but actually each time it match all.

Anyone can help me please?

Upvotes: 2

Views: 326

Answers (2)

Mustofa Rizwan
Mustofa Rizwan

Reputation: 10476

Try this:

^\|[^\w\|]*(\w+\s+(?=\w+)[^\|]*)

Explanation

  1. ^\| matches if the the line starts with a pipe (|)
  2. [^\w\|]* grab anything which not in a-z0-9 and |
  3. \w+\s+ makes sure it is followed by a word and one or more white space
  4. (?=\w+) Then checks if it has more words to follow
  5. [^\|]* if previous conditions found then grabs anything until the next pipe |

For each match, group 1 contains the sentence you desire

Run the Code Sample here

Upvotes: 1

Jan
Jan

Reputation: 43199

You could come up with:

^\|                     # start of line, followed by |
(                       # capture the "words"
    (?:[‘\w]+           # a non-capturing group and at least one of \w or ‘
        (?:[^|\w\n\r]+  # followed by NOT one of these
        |               # or
        (?=\|)          # make sure, there's a | straight ahead
    )
){2,})                  # repeat the construct at least 2 times
\|

See a demo on regex101.com (and mind the modifiers!).
This will capture at least two consecutive words, if you need more, put another number in the {} parentheses.

Upvotes: 0

Related Questions