Reputation: 69
I am working on the parsing a website view-source:https://massive.ucsd.edu/ProteoSAFe/datasets.jsp. I want to parse the .jsp and extract the JSOn object from the same.
I am using Jsoup to extract the data
Document doc = Jsoup.connect("https://massive.ucsd.edu/ProteoSAFe/datasets.jsp").maxBodySize(0).get();
Then using Java pattern to extract Json as string:
Pattern p = Pattern.compile(String.format("\"%s\":\\s*(.*),", "dataset","\"%s\":\\s*(.*),", "datasetNum","\"%s\":\\s*(.*),", "title","\"%s\":\\s*(.*),", "user","\"%s\":\\s*(.*),", "site","\"%s\":\\s*(.*),", "flowname","\"%s\":\\s*(.*),", "createdMillis","\"%s\":\\s*(.*),", "created","\"%s\":\\s*(.*),", "fileCount","\"%s\":\\s*(.*),", "fileSizeKB","\"%s\":\\s*(.*),", "psms","\"%s\":\\s*(.*),", "peptides","\"%s\":\\s*(.*),", "variants","\"%s\":\\s*(.*),", "proteins","\"%s\":\\s*(.*),", "species","\"%s\":\\s*(.*),", "instrument","\"%s\":\\s*(.*),", "modification","\"%s\":\\s*(.*),", "pi","\"%s\":\\s*(.*),", "complete","\"%s\":\\s*(.*),", "status","\"%s\":\\s*(.*),", "private","\"%s\":\\s*(.*),", "hash","\"%s\":\\s*(.*),", "px","\"%s\":\\s*(.*),", "task","\"%s\":\\s*(.*),", "id"));
Matcher m = p.matcher(script.html());
While doing so I am getting error. Last line is not getting parsed correctly. It cuts in the end so I get
'A JSONObject text must end with '}' at character 577' error.
Can anyone suggest me better way to parse this page to get data.
Upvotes: 0
Views: 528
Reputation: 191738
While it seems like a bad idea to parse any HTML with regex.
This works for me Pattern.compile("(?s)var datasets = (\\[.*?\\]);")
(Tested via Python, since that's all I have available).
And that returns a JSONArray
, not a JSONObject
.
Upvotes: 1