Daniel Mills
Daniel Mills

Reputation: 1

AWS Kendra - Create Web Crawler Data Source Java SDK 2 (Groovy)

I am trying to create a new Kendra Web Crawler V2 Data Source on an Kendra existing index in a Groovy Script using the AWS Java V2 SDK (v.2.20.162). However the response from "kendraClient.createDataSource" is always just "Something went wrong" with no further details in the exception. The issue seems to be the template configuration, but I struggling to spot my error and documentation on this form AWS seems a bit sparse.

This tech stack is a requirement of the organisation I am working for. I can not use the web console or CloudFormation.

To create the data source, I am using the following code (simplified and with dunny values for this example):

import software.amazon.awssdk.core.document.Document
import software.amazon.awssdk.services.kendra.KendraClient
import software.amazon.awssdk.services.kendra.model.DataSourceType
import software.amazon.awssdk.services.kendra.model.Tag

import java.nio.file.Files
import java.nio.file.Paths

static void main(String[] args) {
    KendraClient kendraClient = getKendraClient()
    String indexId = getIndexId(kendraClient)
    String roleArn = getRoleArn(kendraClient)
    List<Tag> tags = getTags()
    String documentJsonStr = Files.readString(Paths.get(getClass().getClassLoader().getResource("json/template-kendra-web-crawler.json").toURI()))
    Document document = Document.fromString(documentJsonStr)

    kendraClient.createDataSource { ds -> ds
        .indexId(indexId)
        .type(DataSourceType.TEMPLATE)
        .name("test-wc-v2-datasource")
        .description("Testing the web crawler v2 datasource")
        .roleArn(roleArn)
        .languageCode("en")
        .tags(tags)
        .schedule("cron(0 18 ? * MON-FRI *)")
        .vpcConfiguration { vpc -> vpc
            .subnetIds("subnet-12345", "subnet-67890")
            .securityGroupIds("sg-12345")
        }
        .configuration { config -> config
            .templateConfiguration { tmplConfig -> tmplConfig
                        .template(document)
            }
        }
    }
}

The contents of the JSON file used to create the body of the template is as follows:

{
  "connectionConfiguration": {
    "repositoryEndpointMetadata": {
      "authentication": "NoAuthentication",
      "siteMapUrls": [
        "https://www.my-domain.com/sitemap.xml"
      ]
    }
  },
  "repositoryConfigurations": {
    "webPage": {
      "fieldMappings": [
        {
          "dataSourceFieldName": "sourceUrl",
          "indexFieldName": "_source_uri",
          "indexFieldType": "STRING"
        },
        {
          "dataSourceFieldName": "category",
          "indexFieldName": "_category",
          "indexFieldType": "STRING"
        }
      ]
    },
    "attachment": {
      "fieldMappings": [
        {
          "dataSourceFieldName": "sourceUrl",
          "indexFieldName": "_source_uri",
          "indexFieldType": "STRING"
        },
        {
          "dataSourceFieldName": "category",
          "indexFieldName": "_category",
          "indexFieldType": "STRING"
        }
      ]
    }
  },
  "syncMode": "FULL_CRAWL",
  "additionalProperties": {
    "exclusionFileIndexPatterns": [],
    "exclusionURLCrawlPatterns": [],
    "exclusionURLIndexPatterns": [],
    "inclusionFileIndexPatterns": [],
    "inclusionURLCrawlPatterns": [
      ".*/gb-en/.*",
      ".*/de-en/.*"
    ],
    "inclusionURLIndexPatterns": [],
    "proxy": {},
    "rateLimit": "300",
    "crawlAllDomain": false,
    "crawlAttachments": true,
    "crawlDepth": "5",
    "maxFileSize": "50",
    "crawlSubDomain": true,
    "maxLinksPerUrl": "1000",
    "honorRobots": false
  },
  "type": "WEBCRAWLERV2"
}

I have tried simpler examples without the VPC, only listing the properties that are required by the JSON schema with default values etc. but the the response from the kendraClient.createDataSource is always an exception with a 500 response with the helpful message "Something went wrong".

I suspect the way I am parsing the templateConfiguration.template document is wrong - but I was not able to find another way. Any help would be greatly appreciated!

Upvotes: 0

Views: 321

Answers (1)

Teja K
Teja K

Reputation: 1

You can refer to this CLI documentation for the Configuration structure: Reference for Configuration structure

Additional References: WebCrawler Schema

If you still face issues you can file a ticket to the team with relevant details like indexId and datasourceId, so the Kendra team can help to look into the issue/error for you.

Upvotes: 0

Related Questions