How to filter out non-json documents in MarkLogic?

I have a lot of data loaded in my database where some of the documents loaded are not JSON files & just binary files. Correct data looks like this: "/foo/bar/1.json" but the incorrect data is in the format of "/foo/bar/*". Is there a mechanism in MarkLogic using JavaScript where I can filter out this junk data and delete them? PS: I'm unable to extract files with mlcp that have a "?" in the URI and maybe when I try to reload this data I get this error. Any way to fix that extract along with this?

Upvotes: 1

Answers (1)

Mads Hansen

Reputation: 66714

If all of the document URIs contain a ? and are in that directory, then you could use cts.uriMatch()

declareUpdate();
for (const uri of cts.uriMatch('/foo/bar/*?*') ) {
  xdmp.documentDelete(uri)  
}

Alternatively, if you are looking to find the binary() documents, you can apply the format-binary option to a cts.search() with a cts.directoryQuery() and then delete them.

declareUpdate();
for (const doc of cts.search(cts.directoryQuery("/foo/bar/"), ['format-json']) ) {
  xdmp.documentDelete(fn.baseUri(doc));
}

They are probably being persisted as binary because there is no file extension when the URI ends with a question mark and some querystring parameter values i.e. 1.json?foo=bar instead of 1.json

It is difficult to diagnose and troubleshoot without seeing what your MLCP job configs are and knowing more about what you are doing to load the data.

Upvotes: 0

How to filter out non-json documents in MarkLogic?

Answers (1)

Related Questions