Steve3p0
Steve3p0

Reputation: 2527

grep regex one-liner?

I have a large file that contains all surface forms of lexemes in a particular language. I wanted to extract just the verb inflection patterns, specifically 1st, 2nd, 3rd person singular and plural in the present tense.

I tested the following regex using this online tool and it correctly identifies the lines I am trying to extract.

regex: Vm-p\d.+(e|p)

Below is a sample of what the file looks like with lines that are a *match*:

сломе                   сломити               Vm-p3p-an-n---e *match*
сломи                   сломити               Vmmp2s-an-n---e
сломи                   сломити               Vm-p3s-an-n---e *match*
сломивши                сломити               Rvp
сломиле                 сломити               Vmps-pfan-n---e
сломим                  сломити               Vm-p1s-an-n---e *match*
сломимо                 сломити               Vm-p1p-an-n---e *match*
сломите                 сломити               Vm-p2p-an-n---e *match*
сломићеш                сломити               Vmif2s-an-n---e
сломиш                  сломити               Vm-p2s-an-n---e *match*
иде                     ићи                   Vmia2s-an-n---p
иде                     ићи                   Vm-p3s-an-n---p *match*
идем                    ићи                   Vm-p1s-an-n---p *match*
идемо                   ићи                   Vm-p1p-an-n---p *match*
идео                    ићи                   Vmps-sman-n---p
идете                   ићи                   Vm-p2p-an-n---p *match*
идеш                    ићи                   Vm-p2s-an-n---p *match*
идоше                   ићи                   Vmia3p-an-n---p
иду                     ићи                   Vm-p3p-an-n---p *match*
идући                   ићи                   Rvp
иђасте                  ићи                   Vmii2p-an-n---p
иђаху                   ићи                   Vmii3p-an-n---p
иђаше                   ићи                   Vmii2s-an-n---p
ићи                     ићи                   Vmn----an-n---p
ишавши                  ићи                   Rvp

However, when I try to use grep on the command line, I can only get parts of it to work but not the whole thing together. Is there a better way? I wasn't able to find a good reference online. I am expecting that I'll be searching for other patterns beyond this.

What have I tried? This works, but how can I combine them?

$ grep -P "Vm-p\d.+e" input.txt >> sr_verbs.txt
$ grep -P "Vm-p\d.+p" input.txt >> sr_verbs.txt

Update: As @kevinji pointed out, my original regex should have worked with the -P option. I tried it again today and it did. Well, I guess I'm not sure exactly what I did. Anyway, this works fine.

$ grep -P "Vm-p\d.+(e|p)" input.txt

Upvotes: 0

Views: 123

Answers (2)

anubhava
anubhava

Reputation: 785621

It is easier to handle with awk:

awk '$3 ~ /^Vm-p[0-9]+.+[ep]/' file

сломе                   сломити               Vm-p3p-an-n---e *match*
сломи                   сломити               Vm-p3s-an-n---e *match*
сломим                  сломити               Vm-p1s-an-n---e *match*
сломимо                 сломити               Vm-p1p-an-n---e *match*
сломите                 сломити               Vm-p2p-an-n---e *match*
сломиш                  сломити               Vm-p2s-an-n---e *match*
иде                     ићи                   Vm-p3s-an-n---p *match*
идем                    ићи                   Vm-p1s-an-n---p *match*
идемо                   ићи                   Vm-p1p-an-n---p *match*
идете                   ићи                   Vm-p2p-an-n---p *match*
идеш                    ићи                   Vm-p2s-an-n---p *match*
иду                     ићи                   Vm-p3p-an-n---p *match*

With grep you can use:

grep -E '[[:blank:]]Vm-p[0-9]+.+[ep]' file

Upvotes: 3

Kevin Ji
Kevin Ji

Reputation: 10499

You'll want to use what's called a regex "character class" by using brackets, which means "one of any of the characters contained here":

grep -P 'Vm-p\d.+[ep]'

Note that [e|p] is actually slightly different; it matches the characters e, |, or p.

I'm slightly surprised that (e|p) didn't work for you; in fact, (?:e|p) (a non-capturing group) should be identical to [ep].

Upvotes: 1

Related Questions