Batch predictions Vertext AI

How do I create JSONL file which contains list of files in Google Cloud Bucket for Batch prediction in Vertex AI? What I've tried so far.

  1. Get list of file from bucket and write it to a txt file gsutil ls gs://bucket/dir > list.txt
  2. Convert list.txt to list.jsonl following Vertext AI docs:
{"content": "gs://sourcebucket/datasets/images/source_image1.jpg", "mimeType": "image/jpeg"}
{"content": "gs://sourcebucket/datasets/images/source_image2.jpg", "mimeType": "image/jpeg"}

After create batch prediction, I got this error: cannot be parsed as JSONL. How do I correct the format of this JSONL file? Also, is there anyway to directly export list files in bucket to JSONL file format?

Upvotes: 2

Views: 791

Answers (1)

RinsedAndRepeated
RinsedAndRepeated

Reputation: 21

here is some python code you can run to create a working JSON lines file from the list. (Since it's not totally clear in the Google ML documentation, for new people to this process, in the Google Vertex AI command shell you use Unix commands to create the list from the contents of the folder in the first place. If "ls" and "cat" are new to you, find yourself a Unix geek.) If you are new to running python scripts in Windows/MacOS/Linux/YourFlavorOfWeirdness there are all kinds of internet tutorials on what to do. First, save this code snippet as "googleparse.py"

Assuming input file of "googlelist.txt", specifying output of googleparse.jsonl, enter the following into your command prompt.

% python3 googleparse1.py -o googleparse.jsonl googlelist.txt

#
# googleparse.py by Cyberchuck2000:
#
# Parse a list of images from the Google Cloud and format
# into the Google parse format
#
import argparse

parser = argparse.ArgumentParser(description='Produce JSONL files for Google Parse')
parser.add_argument('inputfilename')
parser.add_argument('-o',dest='outputfilename', default='googleparse.jsonl')

prefix = '{"content": \''
suffix = '\', "mimeType": "image/jpeg"}'

args = parser.parse_args()
if args.inputfilename is not None:
    print('The file name is {}, output is {}'.format(args.inputfilename,args.outputfilename))
else:
    print('Oh well ; No args, no problems')

with open(args.inputfilename) as inputf:
    lines = inputf.readlines()

with open(args.outputfilename, 'w') as writef:
    for line in lines:
        line = line.strip()
        outline = prefix + line + suffix + "\n"
        writef.write(outline)

print("**DONE**")

Upvotes: 1

Related Questions