Nick Adams
Nick Adams

Reputation: 199

Using regular expressions to extract string from text file

Essentially i have a txt document with this in it,

The sound of a horse at a gallop came fast and furiously up the hill.
"So-ho!" the guard sang out, as loud as he could roar.
"Yo there! Stand! I shall fire!"
The pace was suddenly checked, and, with much splashing and floundering, a man's voice called from the mist, "Is that the Dover mail?"
"Never you mind what it is!" the guard retorted. "What are you?"
"_Is_ that the Dover mail?"
"Why do you want to know?"
"I want a passenger, if it is."
"What passenger?"
"Mr. Jarvis Lorry."
Our booked passenger showed in a moment that it was his name.
The guard, the coachman, and the two other passengers eyed him distrustfully.

Using regex i need to print everything within double quotes, I dont want the full code i just need to know how i should go about doing it, which regex would be most useful. Tips and pointers please!

Upvotes: 2

Views: 82

Answers (2)

Cyphase
Cyphase

Reputation: 12002

This should do it (explanation below):

from __future__ import print_function

import re

txt = """The sound of a horse at a gallop came fast and furiously up the hill.
"So-ho!" the guard sang out, as loud as he could roar.
"Yo there! Stand! I shall fire!"
The pace was suddenly checked, and, with much splashing and floundering,
a man's voice called from the mist, "Is that the Dover mail?"
"Never you mind what it is!" the guard retorted. "What are you?"
"_Is_ that the Dover mail?"
"Why do you want to know?"
"I want a passenger, if it is."
"What passenger?"
"Mr. Jarvis Lorry."
Our booked passenger showed in a moment that it was his name.
The guard, the coachman, and the two other passengers eyed him distrustfully.
"""

strings = re.findall(r'"(.*?)"', txt)

for s in strings:
    print(s)

Result:

So-ho!
Yo there! Stand! I shall fire!
Is that the Dover mail?
Never you mind what it is!
What are you?
_Is_ that the Dover mail?
Why do you want to know?
I want a passenger, if it is.
What passenger?
Mr. Jarvis Lorry.

r'"(.*?)"' will match every string within double quotes. The parentheses indicate a capture group, so you'll only get the text without the double-quotes. The . matches every character (except for a newline), and the * means "zero or more of the last thing", the last thing being the .. The ? after the * makes the * "non-greedy", which means it matches as little as possible. If you didn't use the ?, you'd only get one result; a string containing everything between the first and last double-quote.

You can include the re.DOTALL flag so that . will also match newline characters, if you want to extract strings that cross lines. If you want to do that, use re.findall(r'"(.*?)"', txt, re.DOTALL). The newline will be included in the string, so you'd have to check for that.

Explanation unavoidably similar to / based on @TigerhawkT3's answer. Vote that answer up, too!

Upvotes: 0

TigerhawkT3
TigerhawkT3

Reputation: 49320

r'(".*?")' will match every string within double quotes. The parentheses indicate a captured group, the . matches every character (except for a newline), the * indicates repetition, and the ? makes it non-greedy (stops matching right before the next double-quote). If you want, include the re.DOTALL option to make . also match newline characters.

Upvotes: 3

Related Questions