Reputation:
In Riak, I have this basic user
schema with an accompanying user
index (I've omitted the riak-specific fields like _yz_id
etc.):
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="user" version="1.5">
<fields>
<field name="email" type="string" indexed="true" stored="false"/>
<field name="name" type="string" indexed="true" stored="false"/>
<field name="groups" type="string" indexed="true" stored="false" multiValued="true"/>
<dynamicField name="*" type="ignored" indexed="false" stored="false" multiValued="true"/>
..riak-specific fields..
</fields>
<uniqueKey>_yz_id</uniqueKey>
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
<fieldType name="_yz_str" class="solr.StrField" sortMissingLast="true"/>
<fieldtype name="ignored" class="solr.StrField"/>
</types>
</schema>
My user JSON looks like this:
{
"name" : "John Smith",
"email" : "[email protected]",
"groups" : [
"3304cf79",
"abe155cf"
]
}
When I attempt to search using this query:
curl http://localhost:10018/search/query/user?wt=json&q=groups:3304cf79
I get no docs
back.
Why is this? Is the JSON extractor creating index entries for the groups?
Upvotes: 0
Views: 111
Reputation: 1163
how about this? you can extract all based at once, it's generic
import json
import numpy as np
import pandas as pd
from jsonpath_ng import jsonpath, parse
def explode_list(df, col):
s = df[col]
i = np.arange(len(s)).repeat(s.str.len())
return df.iloc[i].assign(**{col: np.concatenate(s)})
def process_json_data(data_file, mapping_file, root):
# Load the JSON data
with open(data_file) as f:
data = json.load(f)
# Load the mapping
with open(mapping_file) as f:
mapping = json.load(f)
# Prepare an empty dataframe to hold the results
df = pd.DataFrame()
# Iterate over each datapoint in the data file
for i, datapoint in enumerate(data[root]):
# Prepare an empty dictionary to hold the results for this datapoint
datapoint_dict = {}
# Iterate over each field in the mapping file
for field, path in mapping.items():
# Prepare the JSONPath expression
jsonpath_expr = parse(path)
# Find the first match in the datapoint
match = jsonpath_expr.find(datapoint)
if match:
# If a match was found, add it to the dictionary
datapoint_dict[field] = [m.value for m in match]
else:
# If no match was found, add 'no path' to the dictionary
datapoint_dict[field] = ['no path']
# Create a temporary dataframe for this datapoint
frames = [pd.DataFrame({k: np.repeat(v, max(map(len, datapoint_dict.values())))}) for k, v in datapoint_dict.items()]
temp_df = pd.concat(frames, axis=1)
# Identify list-like columns and explode them
while True:
list_cols = [col for col in temp_df.columns if any(isinstance(i, list) for i in temp_df[col])]
if not list_cols:
break
for col in list_cols:
temp_df = explode_list(temp_df, col)
# Append the temporary dataframe to the main dataframe
df = df.append(temp_df)
df.reset_index(drop=True, inplace=True)
return df.style.set_properties(**{'border': '1px solid black'})
# Calling the function
df = process_json_data('/content/jsonShredd/data.json', '/content/jsonShredd/mapping.json', 'datapoints')
df
Upvotes: -1
Reputation:
The schema is correct. The issue was that it was not the original schema I used to set the bucket properties. This issue on the Yokozuna GitHub was the culprit. I updated the schema after inserting new data thinking that the indexes would reload. Currently, they do not.
Upvotes: 0