We have ...a file containing paragraphs, splitted by 2 newlines \r\n\r\n or \n\n . The paraghraphs themselves may contain single newlines \r\n or \n . The goal is to use a Bash one-liner to match only the first paragraph and to print it to stdout. E.G.: $ cat foo.txt Foo * Bar Baz * Foobar Even more stuff to match here. results in: $ cat foo.txt | <some-command> Foo * Bar I've already tried ...this regex (?s)(.+?)(\r?\n){2}|.+?$ with grep using GIT Bash on Windows (GNU grep 3.1), Bash on Lubuntu 20.4.1 LTS (GNU grep 3.4) and iTerm+Fish on Mac (BSD grep 2.5.1-FreeBSD). The first two approaches resulted in: $ grep -Poz '(?s)(.+?)(\r?\n){2}|.+?$' foo.txt Foo * Bar Baz * Foobar The approach on Mac failed, due to differences between BSD grep and GNU grep. But ... on regex101.com this regex works on foo.txt: https://regex101.com/r/uoej8O/1 . This may be due to disabling the global flag?

You can use a GNU grep like this: grep -Poz '(?s)^.+?(?=\R{2}|$)' file See the PCRE regex demo . Details (?s) - a DOTALL inline modifier that makes . match all chars including linebreak chars ^ - start of the whole string .+? - any 1 or more chars, as few as possible (?=\R{2}|$) - a positive lookahead that matches a location immediately followed with a double line break sequence ( \R{2} ) or end of string ( $ ).

regexbashawksedgrep

trilloyd

Reputation: 101

Match only the first paragraph using bash

We have

...a file containing paragraphs, splitted by 2 newlines \r\n\r\n or \n\n. The paraghraphs themselves may contain single newlines \r\n or \n. The goal is to use a Bash one-liner to match only the first paragraph and to print it to stdout.

E.G.:

$ cat foo.txt
Foo
* Bar

Baz
* Foobar

Even more stuff to match here.

results in:

$ cat foo.txt | <some-command>
Foo
* Bar

I've already tried

...this regex (?s)(.+?)(\r?\n){2}|.+?$ with grep using

GIT Bash on Windows (GNU grep 3.1),
Bash on Lubuntu 20.4.1 LTS (GNU grep 3.4) and
iTerm+Fish on Mac (BSD grep 2.5.1-FreeBSD).

The first two approaches resulted in:

$ grep -Poz '(?s)(.+?)(\r?\n){2}|.+?$' foo.txt
Foo                                                                                                                          
* Bar

Baz                                                                                                                          
* Foobar

The approach on Mac failed, due to differences between BSD grep and GNU grep.

But

... on regex101.com this regex works on foo.txt: https://regex101.com/r/uoej8O/1. This may be due to disabling the global flag?

Upvotes: 8

Answers (5)

James Brown

Reputation: 37424

For GNU awk if the paragraphs are separated by \r\n\r\n or \n\n:

$ awk -v RS="\r?\n\r?\n" '{print $0;exit}' file

Output:

Foo
* Bar

Upvotes: 5

Wiktor Stribiżew

Reputation: 627087

You can use a GNU grep like this:

grep -Poz '(?s)^.+?(?=\R{2}|$)' file

See the PCRE regex demo.

Details

(?s) - a DOTALL inline modifier that makes . match all chars including linebreak chars
^ - start of the whole string
.+? - any 1 or more chars, as few as possible
(?=\R{2}|$) - a positive lookahead that matches a location immediately followed with a double line break sequence (\R{2}) or end of string ($).

Upvotes: 4

anubhava

Reputation: 785481

This is a tailor-made problem for gnu awk by using a custom record separator. We can use a custom RS that breaks file data by 2 or more of an optional \r followed by \n:

awk -v RS='(\r?\n){2,}' 'NR == 1' file

This outputs:

Foo
* Bar

If you want awk to be more efficient when input is very big:

awk -v RS='(\r?\n){2,}' '{print; exit}' file

Upvotes: 8

potong

Reputation: 58473

This might work for you (GNU sed):

sed 'N;P;/\n\r\?$/Q;D' file

Open a two line window, print the first of these lines and if the window contains a newline (with an optional return) at the end of a line, quit processing (without printing anything else).

Upvotes: 0

rethab

Reputation: 8433

If you only want the first paragraph and the paragraphs are separated by a newline, then this might work:

awk '!NF{ exit } 1' foo.txt