Reputation: 79
I want to extract contents from large JSON files that appear to editors as one line (so I can't operate on a line basis), e.g.
{"license": 2, "file_name": "COCO_test2014_000000523573.jpg", "coco_url": "http://mscoco.org/images/523573", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523573}, {"license . . .
For example, is there a way (sed, grep, ...?) I can search for the word 000000523573
and print the 100 characters preceding and 200 characters succeeding occurrences of the word?
Upvotes: 0
Views: 83
Reputation: 52112
As demonstrated in ghoti's answer, jq is definitely your best bet.
As for your exact question ("search for the word 000000523573
and print the 100 characters preceding and 200 characters succeeding"): you could use grep -o
as follows:
grep -Eo '.{100}000000523573.{200}' infile
This has a few drawbacks:
000000523573
occurs earlier than 100 characters from the beginning of the file or later than 200 characters from its end, it will be ignored.grep -o
).These can be alleviated somewhat by loosening the requirements to "print up to 100/200 characters before/after occurrences":
grep -Eo '.{,100}000000523573.{,200}' infile
But, again, the proper approach is to use jq. See also this question about command line JSON parsing.
Upvotes: 0
Reputation: 46826
jq
is the tool you want to use to parse JSON natively. If it's a structured format, don't treat it like random text.
$ jq . < input.json
{
"license": 2,
"file_name": "COCO_test2014_000000523573.jpg",
"coco_url": "http://mscoco.org/images/523573",
"height": 500,
"width": 423,
"date_captured": "2013-11-14 12:21:59",
"id": 523573
}
$ jq .height < input.json
500
To search for a particular JSON record that contains a particular string in the file_name
record, you might do something like this:
jq 'select(.file_name|contains("000000523573"))' < input.json
The notation here is ... longer to explain than makes sense for a single SO answer. Do have a look at the JQ query structure if you're interested in using this tool.
Upvotes: 2
Reputation: 12036
data.txt:
{"license": 2, "file_name": "COCO_test2014_000000523573.jpg", "coco_url": "http://mscoco.org/images/523573", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523573}, {"license": 2, "file_name": "COCO_test2014_000000523574.jpg", "coco_url": "http://mscoco.org/images/523574", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523574}
command:
cat data.txt | sed 's/\},\s{/}\n{/g' | grep "000000523573"
output:
{"license": 2, "file_name": "COCO_test2014_000000523573.jpg", "coco_url": "http://mscoco.org/images/523573", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523573}
Upvotes: 0