user1681189
user1681189

Reputation: 11

GSA - Get a subset of the index

I would need to get a list of all the documents in the GSA (GSA 7) index/collection that have one or more of the specific links. So I have a list of urls and need to find any docs that contain them (in the document body, not metadata). There are some 700,000 docs fed from the UCM (they are full-text indexed). The number of docs containing the links is too large to get it through regular search. Is there some OOTB way to get to this? What would be the way to go? I was thinking creating a separate collection but filtering criteria only work on URLs, not the contents of the files.

Thanks in advance, Z

Upvotes: 1

Views: 135

Answers (1)

BigMikeW
BigMikeW

Reputation: 831

Using Entity Recognition you can tag each document containing the URL pattern(s) that you are interested in with a specific piece of metadata. You can then use this generated metadata tag to filter the results to just the ones that you are interested in. Unfortunately, you're still reliant on running a search to find them and you would need to wait for the GSA to re-crawl all your content after creating the ER rule before you could look for these documents.

Alternatively, if you are feeding them from a connector you could add a Document Filter that checks the contents of each file being fed and then logs the URL of the current document somewhere (e.g.:file, db or webservice) if it contains the pattern that you are looking for. This would still require a re-crawl but at least then you don't need to run a search to find matches, you can just consult your log.

Upvotes: 1

Related Questions