pietrop
pietrop

Reputation: 1140

Solr: how to specify a schema during JSON and CSV import?

I'm new to Solr and I'm trying to test its functionalities. I come from RDBMS world and was wondering how Solr would perform with my data.

I created a new core:

$ bin/solr create -c test

and successfully loaded a JSON file using:

$ bin/post -c test file.json

The first record of file.json looks like this:

{"attr":"01234"}

but Solr stores it as:

{"attr":1234}

I began defining a Data Import Handler following this tutorial (Youtube video) in order to correctly store my data, and found that JSON can't be processed by DIH. I'm stuck at the definition of data-config.xml because the tutorial treats XML files using the XPathEntityProcessor but can't find a JSON or even a CSV processor (I can easily retrieve a CSV version of file.json, so loading a CSV or a JSON is the same for me). The official documentation is a bit of a mess and doesn't provide many useful examples. The solely processors that probably treat JSON and CSV documents are LineEntityProcessor and PlainTextEntityProcessor ( Official Documentation).

This other link from the Solr Wiki states:

Goals

...

Make it possible to plugin any kind of datasource (ftp,scp etc) and any other format of user choice (JSON,csv etc)

so I guess it is really possible, but HOW?

I found a similar question posted in 2014 that no one answered here, so was wondering if in 2016, with the newer versions of Solar, there is a well known solution to this problem.

So the question is: how to import JSON and CSV documents using a specific data schema?

UPDATE

Executing http://localhost:8983/solr/test/dihupdate?command=full-import doesn't trigger any error but doesn't load any document. Here are the various xml files located in the core directory:

solrconfig.xml

...
<schemaFactory class="ClassicIndexSchemaFactory" />
...
<requestHandler name="/dihupdate" class="org.apache.solr.handler.dataimport.DataImportHandler" startup="lazy">
  <lst name="defaults">
    <str name="config">data-config.xml</str>
  </lst>
</requestHandler>
...

schema.xml

...
<field name="id" type="long" indexed="true" stored="true" required="true" multiValued="false" />
<field name="attr1" type="string" indexed="true" stored="true" required="true" multiValued="true" />
<field name="date" type="date" indexed="true" stored="false" multiValued="true" />
<field name="attr2" type="string" indexed="true" stored="true"  multiValued="true" />
<field name="attr3" type="string" indexed="true" stored="true" multiValued="true" />
<field name="attr4" type="int" indexed="false" stored="true" multiValued="true" />
<uniqueKey>id</uniqueKey>
...

data-config.xml

<dataConfig>
    <dataSource type="FileDataSource" />
    <document>
        <entity name="f" processor="FileListEntityProcessor"
                fileName="test.json"
                rootEntity="false"
                dataSource="null"
                recursive="true"
                baseDir="/path/to/data/"/>
    </document>
</dataConfig>

Upvotes: 3

Views: 3501

Answers (2)

Alexandre Rafalovitch
Alexandre Rafalovitch

Reputation: 9789

In the Solr distribution, there is a films example (in example/films) that shows how to index JSON and takes advantage of both exact field definitions and type auto-detect. The instructions (README.txt) include the results you will see if you forget to do one of the steps as well.

I suggest you experiment with that and then apply that knowledge to your own use case.

Upvotes: 2

MatsLindh
MatsLindh

Reputation: 52792

Defining the schema is either done in schema.xml in your conf directory - this is the traditional way of setting up the expected format for documents (Defining Fields). If you're using the "Managed Schema" mode which is the current default, you'll have to switch to using the classic schema factory. You can then define the fields in your schema.xml by following the example schema, or any resource available on the web that describes how the schema.xml file is structured (you define a field type and then fields that uses that field type).

The other option is the managed schema - this is the default in the most recent releases, and this schema is manipulated through the API that Solr offers. On startup it reads the initial schema from schema.xml (if present), but after that you'll have to modify it through the API or the Admin interface. This API is described (with examples) at the Schema API page in the Solr guide.

Using a StrField (which ìs what the field type string uses) to store 012345 would result in Solr storing just the literal value, 012345, without converting it to an integer. That's probably a good place to start.

Upvotes: 2

Related Questions