Ben
Ben

Reputation: 79

How to extract contents of a large text file that appears to editors as only one line

I want to extract contents from large JSON files that appear to editors as one line (so I can't operate on a line basis), e.g.

{"license": 2, "file_name": "COCO_test2014_000000523573.jpg", "coco_url": "http://mscoco.org/images/523573", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523573}, {"license . . .

For example, is there a way (sed, grep, ...?) I can search for the word 000000523573 and print the 100 characters preceding and 200 characters succeeding occurrences of the word?

Upvotes: 0

Views: 83

Answers (3)

Benjamin W.
Benjamin W.

Reputation: 52112

As demonstrated in ghoti's answer, jq is definitely your best bet.

As for your exact question ("search for the word 000000523573 and print the 100 characters preceding and 200 characters succeeding"): you could use grep -o as follows:

grep -Eo '.{100}000000523573.{200}' infile

This has a few drawbacks:

  • If 000000523573 occurs earlier than 100 characters from the beginning of the file or later than 200 characters from its end, it will be ignored.
  • If the distance between two occurrences is less than 300 characters, the later occurrence will be ignored (overlapping occurrences are not accounted for by grep -o).

These can be alleviated somewhat by loosening the requirements to "print up to 100/200 characters before/after occurrences":

grep -Eo '.{,100}000000523573.{,200}' infile

But, again, the proper approach is to use jq. See also this question about command line JSON parsing.

Upvotes: 0

ghoti
ghoti

Reputation: 46826

jq is the tool you want to use to parse JSON natively. If it's a structured format, don't treat it like random text.

$ jq . < input.json
{
  "license": 2,
  "file_name": "COCO_test2014_000000523573.jpg",
  "coco_url": "http://mscoco.org/images/523573",
  "height": 500,
  "width": 423,
  "date_captured": "2013-11-14 12:21:59",
  "id": 523573
}
$ jq .height < input.json
500

To search for a particular JSON record that contains a particular string in the file_name record, you might do something like this:

jq 'select(.file_name|contains("000000523573"))' < input.json

The notation here is ... longer to explain than makes sense for a single SO answer. Do have a look at the JQ query structure if you're interested in using this tool.

Upvotes: 2

Flash Thunder
Flash Thunder

Reputation: 12036

data.txt:

{"license": 2, "file_name": "COCO_test2014_000000523573.jpg", "coco_url": "http://mscoco.org/images/523573", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523573}, {"license": 2, "file_name": "COCO_test2014_000000523574.jpg", "coco_url": "http://mscoco.org/images/523574", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523574}

command:

cat data.txt | sed 's/\},\s{/}\n{/g' | grep "000000523573"

output:

{"license": 2, "file_name": "COCO_test2014_000000523573.jpg", "coco_url": "http://mscoco.org/images/523573", "height": 500, "width": 423, "date_captured": "2013-11-14 12:21:59", "id": 523573}

Upvotes: 0

Related Questions