Reputation: 61
I have 10000
descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.
For example:
"police arrests 4 people"
"7 people were arrested".
The numbers range from 1-99
.
I have tried the following code:
gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")
I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.
Upvotes: 5
Views: 9644
Reputation:
The following works for me (solution based on @PoulBak's idea):
clear
input strL var1
"This is 1 long string saying that police arrests 4 people"
"3 news outlets today reported that 7 people were arrested"
"several witnesses saw 5 people arrested and other 3 killed"
end
generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")
list
+-------------------------------------------------------------------------------------+
| var1 var2 |
|-------------------------------------------------------------------------------------|
1. | This is 1 long string saying that police arrests 4 people arrests 4 |
2. | 3 news outlets today reported that 7 people were arrested 7 people were arrested |
3. | several witnesses saw 5 people arrested and other 3 killed 5 people arrested |
+-------------------------------------------------------------------------------------+
Upvotes: 2
Reputation: 10930
You can use this regex:
(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))
It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.
It creates a non capturing Group
, that matches a number
from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter
and Space (the other Words) before it matches 'arrests OR arrested.
It then ORs that with the opposite situation (where the number comes last).
This will match, if the number is within 20 chars
from 'arrests|arrested
'.
Upvotes: 4
Reputation: 10139
Perhaps something like this?
(\d+)[^,.\d\n]+?(?=arrest|custody)|(?<=arrest|custody)[^,.\d\n]+?(\d+)
Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.
(\d+)[^,.\d\n]+?(?=arrest|custody)
First option if # comes before watched terms
(\d+)
the number to capture, with +
one or more digits[^,.\d\n]+?
matches anything except a comma ,
, period .
, digit \d
, or new line \n
. These prevent FPs from different sentences (must be contained in the same sentence) - +?
one or more times (lazy)(?=arrest|custody)
positive look ahead checking for either word:(?<=arrest|custody)[^,.\d\n]+?(\d+)
Second option if # comes after watched terms
(?<=arrest|custody)
positive lookbehind checking that the word comes before #[^,.\d\n]+?
matches anything except a comma ,
, period .
, digit \d
, or new line \n
. These prevent FPs from different sentences (must be contained in the same sentence) - +?
one or more times (lazy)(\d+)
the number to capture, with +
one or more digitsIf you want to add textual representations of your numbers, then you would incorporate that into the (\d+)
capturing group.
If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups
Upvotes: 2