Transform one field into multiple fields in Solr

Question

I am trying to index some data into Solr from a Nutch 1.16 crawl, but some fields either have redundant data (i.e. "metatag.author":["someone","someone"]") or they just mash every single metadata field into one (i.e. "content":["Raro Bueno Raro Bueno Chuzausen Awesome Is Grey, track 6, disc 0/0 2013-08-17T22:40:55 electronic 30014.912 "]).

What I would like to know is if there is a command either before indexing or preferably after having indexed the data to modify the "content" field into either splitting it up in different, equally important fields (i.e. metatag.author, track_number and album being independent fields) OR at least having the elements in "content" be displayed in their own tags ,so something like:

"content":{
   "track_number":["..."],
   "album":[...],
   "tags":[..],
   ...},
...

Sebastian Nagel · Accepted Answer

Nutch provides a plugin "index-metadata" which allows to add arbitrary fields available in parse or content metadata to the indexed documents. The mp3 files are parsed using the plugin "parse-tika" which already fills multiple fields in the parse metadata:

$> bin/nutch parsechecker -Dplugins.includes='protocol-file|parse-tika' \
    file:/.../RainDogs.mp3 
...
contentType: audio/mpeg
...
Status: success(1,0)
Title: Rain Dogs
Outlinks: 0
Content Metadata: Last-Modified=Sat, 07 Aug 2010 11:53:42 GMT Content-Length=4250145 nutch.crawl.score=0.0 Content-Type=audio/mpeg 
Parse Metadata: xmpDM:genre= creator=Tom Waits xmpDM:album=Rain Dogs xmpDM:trackNumber=10 xmpDM:releaseDate=1985 meta:author=Tom Waits xmpDM:artist=Tom Waits dc:creator=Tom Waits xmpDM:audioCompressor=MP3 xmpDM:audioChannelType=Stereo version=MPEG 3 Layer III Version 1 xmpDM:logComment= xmpDM:audioSampleRate=44100 channels=2 dc:title=Rain Dogs Author=Tom Waits xmpDM:duration=177093.546875 Content-Type=audio/mpeg samplerate=44100

Now you can select any of the fields and add them to the index. First, I would test the settings using the tool "indexchecker":

$> bin/nutch indexchecker \
    -Dplugins.includes='protocol-file|parse-tika|index-(basic|metadata)' \
    -Dindex.parse.md='creator,xmpDM:album' \
    file:/.../RainDogs.mp3 
contentType: audio/mpeg
creator :       Tom Waits
xmpDM:album :   Rain Dogs
tstamp :        Sun Apr 05 13:12:51 CEST 2020
digest :        0ff28956642335818afc7f00b5420e93
host :
id :    file:/mnt/data/wastl/private2/musik/player_sync/rock/Tom Waits - Rain Dogs/10 - Tom Waits - Rain Dog
title : Rain Dogs
url :   file:/mnt/data/wastl/private2/musik/player_sync/rock/Tom Waits - Rain Dogs/10 - Tom Waits - Rain Dog
content :       Rain Dogs
Rain Dogs
Tom Waits
Rain Dogs, track 10
1985
177093.55

After that you'd need to transfer the configuration properties to the nutch-site.xml and ev. also adapt the Solr schema.

The field "content" could be useful to feed a single search box, esp. in cases where fields are not correctly filled. Also think of situations where you have multiple authors (music, lyrics, arrangement) and performers (solo, vocals, conductor, etc.)

Transform one field into multiple fields in Solr

Answers (1)

Related Questions