Reputation: 45
I have a JSON data set with around 8.7 million key value pairs extracted from a Redis store, where each key is guaranteed to be an 8 digit number, and the key is an 8 alphanumeric character value i.e.
[{
"91201544":"INXX0019",
"90429396":"THXX0020",
"20140367":"ITXX0043",
...
}]
To reduce Redis memory usage, I want to transform this into a hash of hashes, where the hash prefix key is the first 6 characters of the key (see this link) and then store this back into Redis.
Specifically, I want my resulting JSON data structure (that I'll then write some code to parse this JSON structure and create a Redis command file consisting of HSET, etc) to look more like
[{
"000000": { "00000023": "INCD1234",
"00000027": "INCF1423",
....
},
....
"904293": { "90429300": "THXX0020",
"90429302": "THXX0024",
"90429305": "THXY0013"}
}]
Since I've been impressed by jq and I'm trying to be more proficient at functional style programming, I wanted to use jq for this task. So far I've come up with the following:
% jq '.[0] | to_entries | map({key: .key, pfx: .key[0:6], value: .value}) | group_by(.pfx)'
This gives me something like
[
[
{
"key": "00000130",
"pfx": "000001",
"value": "CAXX3231"
},
{
"key": "00000162",
"pfx": "000001",
"value": "CAXX4606"
}
],
[
{
"key": "00000238",
"pfx": "000002",
"value": "CAXX1967"
},
{
"key": "00000256",
"pfx": "000002",
"value": "CAXX0727"
}
],
....
]
I've tried the following:
% jq 'map(map({key: .pfx, value: {key, value}}))
| map(reduce .[] as $item ({}; {key: $item.key, value: [.value[], $item.value]} ))
| map( {key, value: .value | from_entries} )
| from_entries'
which does give me the correct result, but also prints out an error for every reduce (I believe) of
jq: error: Cannot iterate over null
The end result is
{
"000001": {
"00000130": "CAXX3231",
"00000162": "CAXX4606"
},
"000002": {
"00000238": "CAXX1967",
"00000256": "CAXX0727"
},
...
}
which is correct, but how can I avoid getting this stderr warning thrown as well?
Upvotes: 3
Views: 3297
Reputation: 116750
For the record, jq's group_by
relies on sorting, which of course will slow things down noticeably when the input is sufficiently large. The following is about 40% faster even when the input array has just 100,000 items:
def compress:
. as $in
| reduce keys[] as $key ({};
$key[0:6] as $k6
| $key[6:] as $k2
| .[$k6] += {($k2): $in[$key]} );
.[0] | compress
Given Jeff's input, the output is identical.
Upvotes: 0
Reputation: 12898
I understand that this is not what you are asking for but, just for the reference, I think it will be MUCH more faster to do this with Redis's built-in Lua scripting.
And it turns out that it is a bit more straightforward:
for _,key in pairs(redis.call('keys', '*')) do
local val = redis.call('get', key)
local short_key = string.sub(key, 0, -2)
redis.call('hset', short_key, key, val)
redis.call('del', key)
end
This will be done in place without transferring from/to Redis and converting to/from JSON.
Run it from console as:
$ redis-cli eval "$(cat script.lua)" 0
Upvotes: 0
Reputation: 134881
I'm not sure there's enough data here to assess what the source of the problem is. I find it hard to believe that what you tried results in that. I'm getting errors with that all the way.
Try this filter instead:
.[0]
| to_entries
| group_by(.key[0:6])
| map({
key: .[0].key[0:6],
value: map(.key=.key[6:8]) | from_entries
})
| from_entries
Given data that looks like this:
[{
"91201544":"INXX0019",
"90429396":"THXX0020",
"20140367":"ITXX0043",
"00000023":"INCD1234",
"00000027":"INCF1423",
"90429300":"THXX0020",
"90429302":"THXX0024",
"90429305":"THXY0013"
}]
Results in this:
{
"000000": {
"23": "INCD1234",
"27": "INCF1423"
},
"201403": {
"67": "ITXX0043"
},
"904293": {
"00": "THXX0020",
"02": "THXX0024",
"05": "THXY0013",
"96": "THXX0020"
},
"912015": {
"44": "INXX0019"
}
}
Upvotes: 2