Reputation: 109387
I am using Lucene in PHP (using the Zend Framework implementation). I am having a problem that I cannot search on a field which contains a number.
Here is the data in the index:
ts | contents --------------+----------------- 1236917100 | dog cat gerbil 1236630752 | cow pig goat 1235680249 | lion tiger bear nonnumeric | bass goby trout
My problem: A query for "ts:1236630752
" returns no hits. However, a query for "ts:nonnumeric
" returns a hit.
I am storing "ts" as a keyword field, which according to documentation "is not tokenized, but is indexed and stored. Useful for non-text fields, e.g. date or url." I have tried treating it as a "text" field, but the behavior is the same except that a query for "ts:*
" returns nothing when ts is text.
I'm using Zend Framework 1.7 (just downloaded the latest 3 days ago) and PHP 5.2.9. Here is my code:
<?php
//=========================================================
// Initializes Zend Framework (Zend_Loader).
//=========================================================
set_include_path(realpath('../library') . PATH_SEPARATOR . get_include_path());
require_once('Zend/Loader.php');
Zend_Loader::registerAutoload();
//=========================================================
// Delete existing index and create a new one
//=========================================================
define('SEARCH_INDEX', 'test_search_index');
if(file_exists(SEARCH_INDEX))
foreach(scandir(SEARCH_INDEX) as $file)
if(!is_dir($file))
unlink(SEARCH_INDEX . "/$file");
$index = Zend_Search_Lucene::create(SEARCH_INDEX);
//=========================================================
// Create this data in index:
// ts | contents
// --------------+-----------------
// 1236917100 | dog cat gerbil
// 1236630752 | cow pig goat
// 1235680249 | lion tiger bear
// nonnumeric | bass goby trout
//=========================================================
function add_to_index($index, $ts, $contents) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Keyword('ts', $ts));
$doc->addField(Zend_Search_Lucene_Field::Text('contents', $contents));
$index->addDocument($doc);
}
add_to_index($index, '1236917100', 'dog cat gerbil');
add_to_index($index, '1236630752', 'cow pig goat');
add_to_index($index, '1235680249', 'lion tiger bear');
add_to_index($index, 'nonnumeric', 'bass goby trout');
//=========================================================
// Run some test queries and output results
//=========================================================
echo '<html><body><pre>';
function run_query($index, $query) {
echo "Running query: $query\n";
$hits = $index->find($query);
echo 'Got ' . count($hits) . " hits.\n";
foreach($hits as $hit)
echo " ts='$hit->ts', contents='$hit->contents'\n";
echo "\n";
}
run_query($index, 'pig'); //1 hit
run_query($index, 'ts:1236630752'); //0 hits
run_query($index, '1236630752'); //0 hits
run_query($index, 'ts:pig'); //0 hits
run_query($index, 'contents:pig'); //1 hits
run_query($index, 'ts:[1236630700 TO 1236630800]'); //0 hits (range query)
run_query($index, 'ts:*'); //4 hits if ts is keyword, 1 hit otherwise
run_query($index, 'nonnumeric'); //1 hits
run_query($index, 'ts:nonnumeric'); //1 hits
run_query($index, 'trout'); //1 hits
Output
Running query: pig Got 1 hits. ts='1236630752', contents='cow pig goat' Running query: ts:1236630752 Got 0 hits. Running query: 1236630752 Got 0 hits. Running query: ts:pig Got 0 hits. Running query: contents:pig Got 1 hits. ts='1236630752', contents='cow pig goat' Running query: ts:[1236630700 TO 1236630800] Got 0 hits. Running query: ts:* Got 4 hits. ts='1236917100', contents='dog cat gerbil' ts='1236630752', contents='cow pig goat' ts='1235680249', contents='lion tiger bear' ts='nonnumeric', contents='bass goby trout' Running query: nonnumeric Got 1 hits. ts='nonnumeric', contents='bass goby trout' Running query: ts:nonnumeric Got 1 hits. ts='nonnumeric', contents='bass goby trout' Running query: trout Got 1 hits. ts='nonnumeric', contents='bass goby trout'
Upvotes: 4
Views: 7337
Reputation: 1
I was able to get text and numbers pretty readily by using Zend/Search/Lucene/Analysis/Analyzer/Common/TextNum.php as the default (use ::setDefault(...) as described above.
My problem is that I was trying to index a large set of software and hardware wtih a long history and many version numbers. Zend Search Lucene was not tokenizing "words" like "1.5.3" or anything with a dot (IP addresses, e.g.), underscore or hyphen.
I first made a copy of TextNum.php, renamed TextNumSSC.php (SSC is our application name) and tried editing the RegEx:
do {
if (! preg_match('/[a-zA-Z0-9.-_]+/', $this->_input, $match, PREG_OFFSET_CAPTURE, $this->_position)) {
// It covers both cases a) there are no matches (preg_match(...) === 0)
// b) error occured (preg_match(...) === FALSE)
return null;
}
Still no luck.
Then I installed http://codefury.net/projects/StandardAnalyzer/ in the way instructed, outside the Zend directory structure, changed the RegEx to
'/[a-zA-Z0-9.-_]+/'
and now it works.
Not sure the root cause of this, but couldn't find anything on SO or web to address this dot issue.
Upvotes: 0
Reputation: 39583
The find() method tokenizes the query, and with the default Analzer your numbers will be pretty much ignored. If you want to search for a number you have to construct the query or use an alternate analyzer that includes numeric values..
http://framework.zend.com/manual/en/zend.search.lucene.searching.html
It is important to note that the query parser uses the standard analyzer to tokenize separate parts of query string. Thus all transformations which are applied to indexed text are also applied to query strings.
The standard analyzer may transform the query string to lower case for case-insensitivity, remove stop-words, and stem among other transformations.
The API method doesn't transform or filter input terms in any way. It's therefore more suitable for computer generated or untokenized fields.
Upvotes: 4
Reputation: 14458
I'm used to using Lucene under Java so I can't tell if your code is correct but it seems like the field is being tokanized in a manner that is stripping out anything exept [a-zA-Z].
It may help shed light on the situation to use an index explorer tool like http://www.getopt.org/luke/ to see exactly what is in the index.
Upvotes: 2