Solr cannot search for nutch crawled entries, despite fields being signed as indexed = true

Question

I'm running both a Nutch 1.16 crawler instance and a Solr version 8.3.0. I have been able to crawl for files on a local directory and, editing nutch-site.xml, extract some metadata from them (albeit not as much as I wished for) running bin/crawl -s urls dircrawl 2 >& dircrawl.log. The crawled data is then sent to Solr via bin/nutch index dircrawl/crawldb/ -linkdb dircrawl/linkdb/ -dir dircrawl/segments/ -filter -normalize, where the entries are then stored and managed via their tags.

Now, running Solr Admin from the UI, I'm trying to search for the data. I made sure to sign as indexed=true all the entries I am interested in. HOWEVER, running any search other than for *:* returns zero results. I have tried all possible combinations of search fields, no dice either. I'll link to the description of my config files, first for solr then for nutch...

schema.xml (becomes managed-schema when running it, for some reason)



  id
  
  
    
      
    
    
      
    
  
  (all fieldTypes are the default ones)

then nutch-site.xml







 http.agent.name
 NutchSpiderTest



  http.robots.agents
  NutchSpiderTest,*
  ...
  



  plugin.includes
  protocol-file|urlfilter-regex|parse-(html|tika|metatags|text)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
  ...
  



 file.content.limit
 -1
  Needed to stop buffer overflow errors - Unable to read.....



  file.crawl.parent
  false
  The crawler is not restricted to the directories that you specified in the
    Urls file but it is jumping into the parent directories as well. For your own crawlings you can
    change this behavior (set to false) the way that only directories beneath the directories that you specify get
    crawled.




    parser.skip.truncated
    false
    Boolean value for whether we should skip parsing for truncated documents. By default this
        property is activated due to extremely high levels of CPU which parsing can sometimes take.
    





metatags.names
*
 ...




  index.parse.md
  metatag.description,metatag.keywords,metatag.author,metatag.channels,metatag.content_encoding,metatag.content_type,metatag.creator,metatag.dc_creator,metatag.dc_title,metatag.id,metatag.meta_author,metatag.samplerate,metatag.stream_content_type,metatag.stream_name,metatag.stream_size,metatag.stream_source_info,metatag.title,metatag.version,metatag.x_parsed_by,metatag.xmpdm_album,metatag.album,metatag.xmpdm_albumartist,metatag.albumartist,metatag.xmpdm_artist,metatag.artist,metatag.xmpdm_audiochanneltype,metatag.audiochanneltype,metatag.xmpdm_audiocompressor,metatag.audiocompressor,metatag.xmpdm_audiosamplerate,metatag.audiosamplerate,metatag.xmpdm_composer,metatag.composer,metatag.xmpdm_discnumber,metatag.discnumber,metatag.xmpdm_duration,metatag.duration,metatag.xmpdm_genre,metatag.genre,metatag.xmpdm_releasedate,metatag.releasedate,metatag.xmpdm_tracknumber,metatag.tracknumber,metatag.copyright,author,Genre
  
  Comma-separated list of keys to be taken from the parse metadata to generate fields.
  Can be used e.g. for 'description' or 'keywords' provided that these values are generated
  by a parser (see parse-metatags plugin)

Results of running a query for ":":

{
  "responseHeader":{
    ...,
    "params":{
      "q":"*:*",
      "_":"..."}},
  "response":{"numFound":24,"start":0,"docs":[
      {...

Response of running any other kind of query:

{
  "responseHeader":{
    ...
    "params":{
      "q":"Bumblebee",
      "_":"..."}},
  "response":{"numFound":0,"start":0,"docs":[]
  }}

Additionally, the data I'm trying to index is various .mp3 files from the Free Music Archive.

edit: the files I'm trying to search for look like this:

  {
        "metatag.author":["A Kombi",
          "A Kombi"],
        "metatag.samplerate":[44100,
          44100],
        "title":["Plight Of The Bumblebee"],
        "url":["file:/c:/Users/.../fma/fma_small/009/009476.mp3"],
        "content":["Plight Of The Bumblebee
Plight Of The Bumblebee
A Kombi
Music to Drive By, track 2
2004-09-14T00:00:00
Field Recordings
30014.912
"],
        "metatag.creator":["A Kombi",
          "A Kombi"],
        "tstamp":["2020-04-02T15:26:29.507Z"],
        "digest":["ddd4ab2288c5799a5646592e1a63437f"],
        "boost":[0.20851442],
        "id":"file:/c:/Users/.../fma/fma_small/009/009476.mp3",
        "metatag.version":["MPEG 3 Layer III Version 1",
          "MPEG 3 Layer III Version 1"],
        "metatag.channels":[2,
          2],
        "_version_":1662875102548590596}

MatsLindh · Accepted Answer

You have to set which field you're expecting to search against - unless you have a default search field configured. In older versions of schema.xml this can be configured for the schema, but the recommended method is to configure it in the query itself.

However, to support free text search, it's far better to use the edismax query parser by supplying defType=edismax and then setting which fields you want to search through the qf (query fields) parameter.

q=Bumblebee&qf=title&defType=edismax

.. will search for Bumblebee in the title field. You can also give multiple field names to qf, and also adjust the weights given to each:

qf=title^10 content

.. which will search in both title and content, and give ten times more weight to any hits in the title field compared to a hit in the content field.

The fl (field list) parameter adjusts which fields are being returned in the response, which is useful if you only need a small subset of the available fields (such as just the id) to avoid a larger response and having to load all the field values from disk for each document returned.

Solr cannot search for nutch crawled entries, despite fields being signed as indexed = true

Answers (1)

Related Questions