Reputation:

JSON Extractor for Array of Strings

In Riak, I have this basic user schema with an accompanying user index (I've omitted the riak-specific fields like _yz_id etc.):

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="user" version="1.5">

 <fields>
   <field name="email"    type="string"   indexed="true"  stored="false"/>   
   <field name="name"     type="string"   indexed="true"  stored="false"/>   
   <field name="groups"   type="string"   indexed="true"  stored="false" multiValued="true"/>

   <dynamicField name="*" type="ignored"  indexed="false" stored="false" multiValued="true"/>

   ..riak-specific fields.. 

 </fields>

 <uniqueKey>_yz_id</uniqueKey>                                                 

 <types>                                                                       
   <fieldType name="string"  class="solr.StrField"     sortMissingLast="true"/>
   <fieldType name="_yz_str" class="solr.StrField"     sortMissingLast="true"/>
   <fieldtype name="ignored" class="solr.StrField"/>                           
 </types>

</schema>

My user JSON looks like this:

{
   "name" : "John Smith",
   "email" : "[email protected]",
   "groups" : [
      "3304cf79",
      "abe155cf"
   ]
}

When I attempt to search using this query:

curl http://localhost:10018/search/query/user?wt=json&q=groups:3304cf79

I get no docs back.

Why is this? Is the JSON extractor creating index entries for the groups?

Upvotes: 0

Answers (2)

parlad

Reputation: 1163

how about this? you can extract all based at once, it's generic

import json
import numpy as np
import pandas as pd
from jsonpath_ng import jsonpath, parse

def explode_list(df, col):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.len())
    return df.iloc[i].assign(**{col: np.concatenate(s)})

def process_json_data(data_file, mapping_file, root):
    # Load the JSON data
    with open(data_file) as f:
        data = json.load(f)

    # Load the mapping
    with open(mapping_file) as f:
        mapping = json.load(f)

    # Prepare an empty dataframe to hold the results
    df = pd.DataFrame()

    # Iterate over each datapoint in the data file
    for i, datapoint in enumerate(data[root]):
        # Prepare an empty dictionary to hold the results for this datapoint
        datapoint_dict = {}
        # Iterate over each field in the mapping file
        for field, path in mapping.items():
            # Prepare the JSONPath expression
            jsonpath_expr = parse(path)
            # Find the first match in the datapoint
            match = jsonpath_expr.find(datapoint)
            if match:
                # If a match was found, add it to the dictionary
                datapoint_dict[field] = [m.value for m in match]
            else:
                # If no match was found, add 'no path' to the dictionary
                datapoint_dict[field] = ['no path']

        # Create a temporary dataframe for this datapoint
        frames = [pd.DataFrame({k: np.repeat(v, max(map(len, datapoint_dict.values())))}) for k, v in datapoint_dict.items()]
        temp_df = pd.concat(frames, axis=1)

        # Identify list-like columns and explode them
        while True:
            list_cols = [col for col in temp_df.columns if any(isinstance(i, list) for i in temp_df[col])]
            if not list_cols:
                break
            for col in list_cols:
                temp_df = explode_list(temp_df, col)

        # Append the temporary dataframe to the main dataframe
        df = df.append(temp_df)

    df.reset_index(drop=True, inplace=True)
    return df.style.set_properties(**{'border': '1px solid black'})

# Calling the function
df = process_json_data('/content/jsonShredd/data.json', '/content/jsonShredd/mapping.json', 'datapoints')
df

Upvotes: -1

user3594595

Reputation:

The schema is correct. The issue was that it was not the original schema I used to set the bucket properties. This issue on the Yokozuna GitHub was the culprit. I updated the schema after inserting new data thinking that the indexes would reload. Currently, they do not.

Upvotes: 0

JSON Extractor for Array of Strings

Answers (2)

Related Questions