Reputation: 2527
I have a large file that contains all surface forms of lexemes in a particular language. I wanted to extract just the verb inflection patterns, specifically 1st, 2nd, 3rd person singular and plural in the present tense.
I tested the following regex using this online tool and it correctly identifies the lines I am trying to extract.
regex: Vm-p\d.+(e|p)
Below is a sample of what the file looks like with lines that are a *match*:
сломе сломити Vm-p3p-an-n---e *match*
сломи сломити Vmmp2s-an-n---e
сломи сломити Vm-p3s-an-n---e *match*
сломивши сломити Rvp
сломиле сломити Vmps-pfan-n---e
сломим сломити Vm-p1s-an-n---e *match*
сломимо сломити Vm-p1p-an-n---e *match*
сломите сломити Vm-p2p-an-n---e *match*
сломићеш сломити Vmif2s-an-n---e
сломиш сломити Vm-p2s-an-n---e *match*
иде ићи Vmia2s-an-n---p
иде ићи Vm-p3s-an-n---p *match*
идем ићи Vm-p1s-an-n---p *match*
идемо ићи Vm-p1p-an-n---p *match*
идео ићи Vmps-sman-n---p
идете ићи Vm-p2p-an-n---p *match*
идеш ићи Vm-p2s-an-n---p *match*
идоше ићи Vmia3p-an-n---p
иду ићи Vm-p3p-an-n---p *match*
идући ићи Rvp
иђасте ићи Vmii2p-an-n---p
иђаху ићи Vmii3p-an-n---p
иђаше ићи Vmii2s-an-n---p
ићи ићи Vmn----an-n---p
ишавши ићи Rvp
However, when I try to use grep on the command line, I can only get parts of it to work but not the whole thing together. Is there a better way? I wasn't able to find a good reference online. I am expecting that I'll be searching for other patterns beyond this.
What have I tried? This works, but how can I combine them?
$ grep -P "Vm-p\d.+e" input.txt >> sr_verbs.txt
$ grep -P "Vm-p\d.+p" input.txt >> sr_verbs.txt
Update: As @kevinji pointed out, my original regex should have worked with the -P option. I tried it again today and it did. Well, I guess I'm not sure exactly what I did. Anyway, this works fine.
$ grep -P "Vm-p\d.+(e|p)" input.txt
Upvotes: 0
Views: 123
Reputation: 785621
It is easier to handle with awk
:
awk '$3 ~ /^Vm-p[0-9]+.+[ep]/' file
сломе сломити Vm-p3p-an-n---e *match*
сломи сломити Vm-p3s-an-n---e *match*
сломим сломити Vm-p1s-an-n---e *match*
сломимо сломити Vm-p1p-an-n---e *match*
сломите сломити Vm-p2p-an-n---e *match*
сломиш сломити Vm-p2s-an-n---e *match*
иде ићи Vm-p3s-an-n---p *match*
идем ићи Vm-p1s-an-n---p *match*
идемо ићи Vm-p1p-an-n---p *match*
идете ићи Vm-p2p-an-n---p *match*
идеш ићи Vm-p2s-an-n---p *match*
иду ићи Vm-p3p-an-n---p *match*
With grep
you can use:
grep -E '[[:blank:]]Vm-p[0-9]+.+[ep]' file
Upvotes: 3
Reputation: 10499
You'll want to use what's called a regex "character class" by using brackets, which means "one of any of the characters contained here":
grep -P 'Vm-p\d.+[ep]'
Note that [e|p]
is actually slightly different; it matches the characters e
, |
, or p
.
I'm slightly surprised that (e|p)
didn't work for you; in fact, (?:e|p)
(a non-capturing group) should be identical to [ep]
.
Upvotes: 1