Reputation: 1140
I'm new to Solr and I'm trying to test its functionalities. I come from RDBMS world and was wondering how Solr would perform with my data.
I created a new core:
$ bin/solr create -c test
and successfully loaded a JSON file using:
$ bin/post -c test file.json
The first record of file.json
looks like this:
{"attr":"01234"}
but Solr stores it as:
{"attr":1234}
I began defining a Data Import Handler following this tutorial (Youtube video) in order to correctly store my data, and found that JSON can't be processed by DIH. I'm stuck at the definition of data-config.xml
because the tutorial treats XML files using the XPathEntityProcessor
but can't find a JSON or even a CSV processor (I can easily retrieve a CSV version of file.json
, so loading a CSV or a JSON is the same for me). The official documentation is a bit of a mess and doesn't provide many useful examples. The solely processors that probably treat JSON and CSV documents are LineEntityProcessor
and PlainTextEntityProcessor
( Official Documentation).
This other link from the Solr Wiki states:
Goals
...
Make it possible to plugin any kind of datasource (ftp,scp etc) and any other format of user choice (JSON,csv etc)
so I guess it is really possible, but HOW?
I found a similar question posted in 2014 that no one answered here, so was wondering if in 2016, with the newer versions of Solar, there is a well known solution to this problem.
So the question is: how to import JSON and CSV documents using a specific data schema?
Executing http://localhost:8983/solr/test/dihupdate?command=full-import
doesn't trigger any error but doesn't load any document. Here are the various xml files located in the core directory:
solrconfig.xml
...
<schemaFactory class="ClassicIndexSchemaFactory" />
...
<requestHandler name="/dihupdate" class="org.apache.solr.handler.dataimport.DataImportHandler" startup="lazy">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
...
schema.xml
...
<field name="id" type="long" indexed="true" stored="true" required="true" multiValued="false" />
<field name="attr1" type="string" indexed="true" stored="true" required="true" multiValued="true" />
<field name="date" type="date" indexed="true" stored="false" multiValued="true" />
<field name="attr2" type="string" indexed="true" stored="true" multiValued="true" />
<field name="attr3" type="string" indexed="true" stored="true" multiValued="true" />
<field name="attr4" type="int" indexed="false" stored="true" multiValued="true" />
<uniqueKey>id</uniqueKey>
...
data-config.xml
<dataConfig>
<dataSource type="FileDataSource" />
<document>
<entity name="f" processor="FileListEntityProcessor"
fileName="test.json"
rootEntity="false"
dataSource="null"
recursive="true"
baseDir="/path/to/data/"/>
</document>
</dataConfig>
Upvotes: 3
Views: 3501
Reputation: 9789
In the Solr distribution, there is a films example (in example/films) that shows how to index JSON and takes advantage of both exact field definitions and type auto-detect. The instructions (README.txt) include the results you will see if you forget to do one of the steps as well.
I suggest you experiment with that and then apply that knowledge to your own use case.
Upvotes: 2
Reputation: 52792
Defining the schema is either done in schema.xml
in your conf
directory - this is the traditional way of setting up the expected format for documents (Defining Fields). If you're using the "Managed Schema" mode which is the current default, you'll have to switch to using the classic schema factory. You can then define the fields in your schema.xml
by following the example schema, or any resource available on the web that describes how the schema.xml file is structured (you define a field type and then fields that uses that field type).
The other option is the managed schema - this is the default in the most recent releases, and this schema is manipulated through the API that Solr offers. On startup it reads the initial schema from schema.xml (if present), but after that you'll have to modify it through the API or the Admin interface. This API is described (with examples) at the Schema API page in the Solr guide.
Using a StrField (which ìs what the field type string
uses) to store 012345
would result in Solr storing just the literal value, 012345
, without converting it to an integer. That's probably a good place to start.
Upvotes: 2