Reputation: 1039
Okay, so I've got over a hundred JSON files with predictable bad formatting in several places per file.
Instead of using [ ]
to indicate an array, they use { }
For example:
"grid": {
"C1", "D1", "E1", "C2", "D2", "E2", "F2", "B3", "C3", "D3", "E3", "F3", "B4", "C4", "D4", "E4", "F4", "C5", "D5", "E5", "F5", "C6", "D6", "E6"
Each file has multiple arrays in it with this problem, each with a different key.
I came up with this to fix the above example, but it isn't very universal:
sed 's/^\t\t"grid": {/^\t\t"grid: [/; s/"E6" },$/"E6" ],/' myfile.json
I also tried writing a more complicated awk script, something along these lines:
awk -i '/grid/ { gsub("{",{["); gsub("}","]") print $0 }' myfile.json
But it replaced the contents of myfile.json to be only the row that contained the string "grid".
Is there a reliable one-liner to fix this issue?
Upvotes: 0
Views: 251
Reputation: 3985
JSON="$(sed -E 's/([}{])/\n\1\n/g' $FILE)"
while :; do
JQTEST=$(jq '.' <<<"$JSON" 2>&1|grep "Objects must consist of key:value pairs at line")
if [ $rc -eq 0 ]; then
LINE=$(sed -E "s/.* line ([0-9]+), .*/\1/" <<<"$JQTEST")
COL=$(sed -E "s/.* column ([0-9]+)$/\1/" <<<"$JQTEST")
[ "$COL" -ne 1 ] && LINE=$((LINE-1))
JSON=$(sed -E "$LINE s/\{/[/; $LINE s/}/]/" <<<"$JSON")
jq '.' <<<"$JSON" # > "new_${FILE}" or "${FILE}"
$ cat test.json
"grid1": {"C1", "D1", "E1", "C2"},
"grid2": {"C1", "D1", "E1", "C2"},
"grid3": {"C1", "D1", "E1", "C2"}
"grid1": [
"grid2": [
"grid3": [
Upvotes: 1
Reputation: 189936
How's this? (Update: probably scroll down to the Perl version near the end.)
sed -e 's/{\(\([0-9.]\+\|false\|true\|null\|"[^"]*"\) *[,}]\)/[\1/g' \
-e 's/\([,[] *\([0-9.]\+\|false\|true\|null\|"[^"]*"\)\)}/\1]/g' file
In other words, if the thing after {"thing"
or before "thing"}
is a comma or a curly brace (and not a colon, like you would expect in a proper JSON dictionary), switch the curly to a square bracket.
(In the second expression, we will already have replaced any opening curly with a square one, so look for that instead.)
The regex could be made less fugly if your sed
supports -E
or -r
, but unfortunately, this non-standard option is not portable. (In brief, it lets you use the ERE regex dialect instead of BRE, where you mind-numbingly have to backslash grouping parentheses etc.)
Unfortunately, it requires the curly to be on the same line as the contents of the array. Also, like any regex solution, it's not easily able to distinguish between (what looks like JSON inside) a quoted string and actual JSON.
I suppose the same approach could be extended to examine lines which start or end with a lone curly brace, but I'd switch to Awk or Perl for that. In fact, Perl's "slurp mode" perl -0777
could probably handle the entire input file in one go with minor modifications to the regexes.
perl -0777 -pe '
s/([,[]\s*(?:[0-9.]+|false|true|null|"[^"]*")\s*)\}/$1]/g' file.json
This removes any reliance on newlines for analyzing the file, since we read all of it into memory, and rely on \s
to match any whitespace, including newlines.
If you want to modify the file in-place, Perl also supports the -i
option, like some versions of sed
Upvotes: -1
Reputation: 36828
I propose following GNU AWK
solution, let file.json
content be
{"hello": 1,
"grid": {"C1", "D1", "E1", "C2", "D2", "E2", "F2", "B3", "C3", "D3", "E3", "F3", "B4", "C4", "D4", "E4", "F4", "C5", "D5", "E5", "F5", "C6", "D6", "E6"},
"something": "else"}
awk 'BEGIN{FPAT=".";OFS=""}/grid/&&match($0,/\{[^}]*\}/){$RSTART="[";$(RSTART+RLENGTH-1)="]"}{print}' file.json
gives output
{"hello": 1,
"grid": ["C1", "D1", "E1", "C2", "D2", "E2", "F2", "B3", "C3", "D3", "E3", "F3", "B4", "C4", "D4", "E4", "F4", "C5", "D5", "E5", "F5", "C6", "D6", "E6"],
"something": "else"}
Explanation: firstly I inform GNU AWK
that field is any single character (.
) and output field separator (OFS
) is empty string (without that there would be unwanted spaces in output) then for each line with grid
in it and containing literal {
followed by zero or more (*
) non (^
) }
and literal }
, I replace first ($RSTART
) character of what was matched using [
and last ($(RSTART+RLENGTH-1)
) character of what was matched using ]
, for each line, altered or not, I print
it. Note that I use match
function rather than using just regular expression as I then use RSTART
which are set by this variable. Note that return value of match
is used as part of condition so if there will be grid
in line but not {
then said line will remain unchanged.
(tested in gawk 4.2.1)
Upvotes: 1