Reputation: 73
I have 2 directories: 1 with txt files and the other with corresponding JSON (metadata) files (around 90000 of each). There is one JSON file for each CSV file, and they share the same name (they don't share any other fields). I am trying to index all these files in Apache solr.
The txt files just have plain text, I mapped each line to a field call 'sentence' and included the file name as a field using the data import handler. No problems here.
The JSON file has metadata: 3 tags: a URL, author and title (for the content in the corresponding txt file). When I index the JSON file (I just used the _default schema, and posted the fields to the schema, as explained in the official solr tutorial), I don't know how to get the file name into the index as a field. As far as i know, that's no way to use the Data import handler for JSON files. I've read that I can pass a literal through the bin/post tool, but again, as far as I understand, I can't pass in the file name dynamically as a literal.
I NEED to get the file name, it is the only way in which I can associate the metadata with each sentence in the txt files in my downstream Python code.
So if anybody has a suggestion about how I should index the JSON file name along with the JSON content (or even some workaround), I'd be eternally grateful.
Upvotes: 0
Views: 1074
Reputation: 73
As @MatsLindh mentioned in the comments, I used Pysolr to do the indexing and get the filename. It's pretty basic, but I thought I'd post what I did as Pysolr doesn't have much documentation.
So, here's how you use Pysolr to index multiple JSON files, while also indexing the file name of the files. This method can be used if you have your files and your metadata files with the same filename (but different extensions), and you want to link them together somehow, like in my case.
That's all there is to it! Here's the code.
solr = pysolr.Solr('http://localhost:8983/solr/collection_name')
folderpath = directory-where-the-files-are-present
list_for_solr = []
for filepath in iglob(os.path.join(folderpath, '*.meta')):
with open(filepath, 'r') as file:
filename = os.path.basename(filepath)
# filename is xxxx.yyyy.meta
filename_without_extension = '.'.join(filename.split('.')[:2])
content = json.load(file)
solr_content = {}
solr_content['authors'] = content['authors']
solr_content['title'] = content['title']
solr_content['url'] = content['url']
solr_content['filename'] = filename_without_extension
list_for_solr.append(solr_content)
solr.add(list_for_solr)
Upvotes: 1