Reputation: 313
Just learned basics of regex today, and, with some effort, managed to knock together something that ALMOST works.
I've got documents from a book where I need to find articles (a, an, the) within bullets, as opposed to prose.
Sample of a bullet:
· Lorem ipsum lorem (XXX) Lorem · Lorem the ipsum · Lorem ipsum, lorem, and
Sample of prose: (Right) The lorem wrote the ipsum. Lorem ipsum verb ipsum.
So far this does the trick more or less:
$regexArticles = "^·\ [\w ,:;()+-=&·]*\b( the | a | an |The |An )\b.*$"
$articlecount = Select-String -Path $textfile -Pattern $regexArticles -AllMatches
"Article Count: " + $articlecount.Matches.Count
To make that a little more readable, I'll explain my thinking: If the line begins with a bullet and what follows is any number of words and the characters: ", : ( ) + - = & . ;", grab it if there's also articles.
Problem, this doesn't grab the line for the following case:
· Lorem ipsum lorem (XXX) Lorem · Lorem the ipsum · Lorem ipsum, lorem, and
lorem lorem the lorem lorem
How do I retain this sort of logic when the string I want to grab contains line breaks such as this?
If there's an easier way, perhaps just excluding all sentences that contain a period, that would be great (the only problem with that is sometimes those bullets will incorrectly contain periods).
EDIT
Just realized what "almost" worked in my sublime text editor didn't really work at all in Powershell. For whatever reason, even though this returns matches in regex with sublime text, it does NOT for Powershell.
Now I know why. Whereas sublime can handle the bullet character, the shell couldn't, so it was omitted and I didn't notice. Now I just need to know the proper way to grab the bullet unicode and pass it in the same way.
Upvotes: 0
Views: 140
Reputation: 313
As a somewhat hackish fix, because I could not figure out how to detect the middle dot character (u00B7) | (d183), I was able to workaround it by excluding what I did NOT want to find.
"^[^\d^(^\s] *\b( the | a | an |The |An )\b.*$"
I didn't want any lines that began with a number, and I did not want lines that began with an open parenthesis. For now, this works. Unfortunately, I'm going to have to resolve this issue for other regex searches for my application to be useful.
In answer to my original questions, I had an epiphany that I could just add the optional \n? to account for potential line breaks! Final expression looks like this:
^[^\w\d\s(].*\n?\r*?.*\b( the | a | an |The |An )\b.*$
Upvotes: 1