Reputation: 21
How do I create JSONL
file which contains list of files in Google Cloud Bucket for Batch prediction in Vertex AI?
What I've tried so far.
gsutil ls gs://bucket/dir > list.txt
list.txt
to list.jsonl
following Vertext AI docs:{"content": "gs://sourcebucket/datasets/images/source_image1.jpg", "mimeType": "image/jpeg"}
{"content": "gs://sourcebucket/datasets/images/source_image2.jpg", "mimeType": "image/jpeg"}
After create batch prediction, I got this error: cannot be parsed as JSONL.
How do I correct the format of this JSONL
file?
Also, is there anyway to directly export list files in bucket to JSONL
file format?
Upvotes: 2
Views: 791
Reputation: 21
here is some python code you can run to create a working JSON lines file from the list. (Since it's not totally clear in the Google ML documentation, for new people to this process, in the Google Vertex AI command shell you use Unix commands to create the list from the contents of the folder in the first place. If "ls" and "cat" are new to you, find yourself a Unix geek.) If you are new to running python scripts in Windows/MacOS/Linux/YourFlavorOfWeirdness there are all kinds of internet tutorials on what to do. First, save this code snippet as "googleparse.py"
Assuming input file of "googlelist.txt", specifying output of googleparse.jsonl, enter the following into your command prompt.
% python3 googleparse1.py -o googleparse.jsonl googlelist.txt
#
# googleparse.py by Cyberchuck2000:
#
# Parse a list of images from the Google Cloud and format
# into the Google parse format
#
import argparse
parser = argparse.ArgumentParser(description='Produce JSONL files for Google Parse')
parser.add_argument('inputfilename')
parser.add_argument('-o',dest='outputfilename', default='googleparse.jsonl')
prefix = '{"content": \''
suffix = '\', "mimeType": "image/jpeg"}'
args = parser.parse_args()
if args.inputfilename is not None:
print('The file name is {}, output is {}'.format(args.inputfilename,args.outputfilename))
else:
print('Oh well ; No args, no problems')
with open(args.inputfilename) as inputf:
lines = inputf.readlines()
with open(args.outputfilename, 'w') as writef:
for line in lines:
line = line.strip()
outline = prefix + line + suffix + "\n"
writef.write(outline)
print("**DONE**")
Upvotes: 1