Reputation: 4659
Continuing from my earlier post, I have changed the query as according to femtoRgon's post some characters and anchors are not supported by elastic search.
I am looking the way to match the pattern like "xxx-xx-xxxx" in order to look for documents with social security numbers using elastic search
.
Let’s suppose, in indexed documents, I would like to find all those documents that has social security numbers that matches "xxx-xx-xxxx" pattern.
Sample code for indexing the document:
InputStream is = null;
try {
is = new FileInputStream("/home/admin/Downloads/20121221.doc");
ContentHandler contenthandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
parser.parse(is, contenthandler, metadata, new ParseContext());
}
catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
}
Sample Code for searching
QueryBuilder queryBuilderFullText = null;
queryBuilderFullText = QueryBuilders.filteredQuery(QueryBuilders.matchAllQuery(),
FilterBuilders.regexpFilter("_all", "[0-9]{3}?[0-9]{2}?[0-9]{4}"));
SearchRequestBuilder requestBuilder;
requestBuilder = client.prepareSearch()
.setIndices(getDomainIndexId(project))
.setTypes(getProjectTypeId(project))
.setQuery(queryBuilderFullText);
SearchResponse response = requestBuilder.execute().actionGet(ES_TIMEOUT_MS);
SearchHits hits = response.getHits();
if (hits.getTotalHits() > 0) {
System.out.println(hits.getTotalHits());
} else {
return 0l;
}
I am getting hits for following:
45-555-5462
457-55-5462
4578-55-5462
457-55-54623
457-55-5462-23
But as per my requirement, it should only return "457-55-5462" (based on pattern matching "xxx-xx-xxxx").
Please help.
Upvotes: 1
Views: 1888
Reputation: 4659
Seeing as ^
, $
and \d
can't be used, I would do this:
[^0-9-][0-9]{3}-[0-9]{2}-[0-9]{4}[^0-9-]
Or in Java:
FilterBuilders.regexpFilter("_all", "[^0-9-][0-9]{3}-[0-9]{2}-[0-9]{4}[^0-9-]"));
Which checks that before or after the found number are no other numbers or dashes. It does require there be some character before and after the match though, so this won't capture documents that have the social security number as the very beginning or very end.
Upvotes: 1
Reputation: 174816
You forget to add -
before ?
in your regex and also use anchors if necessary.
"[0-9]{3}-?[0-9]{2}-?[0-9]{4}"
OR
"^[0-9]{3}-?[0-9]{2}-?[0-9]{4}$"
Upvotes: 0