ash
ash

Reputation: 73

Index JSON filename along with JSON content in Solr

I have 2 directories: 1 with txt files and the other with corresponding JSON (metadata) files (around 90000 of each). There is one JSON file for each CSV file, and they share the same name (they don't share any other fields). I am trying to index all these files in Apache solr.

The txt files just have plain text, I mapped each line to a field call 'sentence' and included the file name as a field using the data import handler. No problems here.

The JSON file has metadata: 3 tags: a URL, author and title (for the content in the corresponding txt file). When I index the JSON file (I just used the _default schema, and posted the fields to the schema, as explained in the official solr tutorial), I don't know how to get the file name into the index as a field. As far as i know, that's no way to use the Data import handler for JSON files. I've read that I can pass a literal through the bin/post tool, but again, as far as I understand, I can't pass in the file name dynamically as a literal.

I NEED to get the file name, it is the only way in which I can associate the metadata with each sentence in the txt files in my downstream Python code.

So if anybody has a suggestion about how I should index the JSON file name along with the JSON content (or even some workaround), I'd be eternally grateful.

Upvotes: 0

Views: 1074

Answers (1)

ash
ash

Reputation: 73

As @MatsLindh mentioned in the comments, I used Pysolr to do the indexing and get the filename. It's pretty basic, but I thought I'd post what I did as Pysolr doesn't have much documentation.

So, here's how you use Pysolr to index multiple JSON files, while also indexing the file name of the files. This method can be used if you have your files and your metadata files with the same filename (but different extensions), and you want to link them together somehow, like in my case.

  • Open a connection to your Solr instance using the pysolr.Solr command.
  • Loop through the directory containing your files, and get the filename of each file using os.path.basename and store it in a variable (after removing the extension, if necessary).
  • Read the file's JSON content into another variable.
  • Pysolr expects whatever is to be indexed to be stored in a list of dictionaries where each dictionary corresponds to one record.
  • Store all the fields you want to index in a dictionary (solr_content in my code below) while making sure the keys match the field names in your managed-schema file.
  • Append the dictionary created in each iteration to a list (list_for_solr in my code).
  • Outside the loop, use the solr.add command to send your list of dictionaries to be indexed in Solr.
  • That's all there is to it! Here's the code.

    solr = pysolr.Solr('http://localhost:8983/solr/collection_name')
    folderpath = directory-where-the-files-are-present
    list_for_solr = []
    for filepath in iglob(os.path.join(folderpath, '*.meta')):
        with open(filepath, 'r') as file:
            filename = os.path.basename(filepath)
            # filename is xxxx.yyyy.meta
            filename_without_extension = '.'.join(filename.split('.')[:2])
            content = json.load(file)
        solr_content = {}
        solr_content['authors'] = content['authors']
        solr_content['title'] = content['title']
        solr_content['url'] = content['url']
        solr_content['filename'] = filename_without_extension
        list_for_solr.append(solr_content)
    solr.add(list_for_solr)
    

Upvotes: 1

Related Questions