marcin2x4
marcin2x4

Reputation: 1449

Get tables from AWS Glue using boto3

I need to harvest tables and column names from AWS Glue crawler metadata catalogue. I used boto3 but constantly getting number of 100 tables even though there are more. Setting up NextToken doesn't help. Please help if possible.

Desired results is list as follows:

lst = [table_one.col_one, table_one.col_two, table_two.col_one....table_n.col_n]

def harvest_aws_crawler():
glue = boto3.client('glue', region_name='')
response = glue.get_tables(DatabaseName='', NextToken = '')

#response syntax:
#https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/glue.html#Glue.Client.get_tables
crawler_list_tables = []

for tables in response['TableList']:
    while (response.get('NextToken') is not None):
        crawler_list_tables.append(tables['Name'])
        break
print(len(crawler_list_tables))

harvest_aws_crawler()

UPDATED code, still need to have tablename+columnname:

def harvest_aws_crawler():
glue = boto3.client('glue', region_name='')
next_token = ""

#response syntax:
#https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/glue.html#Glue.Client.get_tables
response = glue.get_tables(DatabaseName='', NextToken = next_token)

tables_from_crawler = []
while True:
    table_list = response['TableList']
    for table_dict in table_list:
        table_name = table_dict['Name']
        
        #append table_name+column_name
        for columns in table_name['StorageDescriptor']['Columns']:
            tables_from_crawler.append(table_name + '.' + columns['Name'])
                
        #tables_from_crawler.append(table_name)
    next_token = response.get('NextToken')
    if next_token is None:
        break
print(tables_from_crawler)

harvest_aws_crawler()

Upvotes: 2

Views: 13085

Answers (4)

Tao
Tao

Reputation: 41

You should use MaxResults

response = glue.get_tables(DatabaseName='', NextToken = '', MaxResults = number_that_greater_than_your_actual_tables)

Upvotes: 0

Yaren Sabır
Yaren Sabır

Reputation: 1

tables = list(dynamodb_resource.tables.all()) worked for me. And if I need only names in my script, additionally I use table_name = tables_names[x].name

Upvotes: 0

marcin2x4
marcin2x4

Reputation: 1449

Adding sub-loop did the trick to get table+column result.

#harvest aws crawler metadata
next_token = ""
client = boto3.client('glue',region_name='us-east-1')
crawler_tables = []

while True:
  response = client.get_tables(DatabaseName = '', NextToken = next_token)
  for tables in response['TableList']:
    for columns in tables['StorageDescriptor']['Columns']:
        crawler_tables.append(tables['Name'] + '.' + columns['Name'])
  next_token = response.get('NextToken')
  if next_token is None:
    break
print(crawler_tables)

Upvotes: 2

Kulasangar
Kulasangar

Reputation: 9464

You can try the below approach by using the paginator option:

def get_tables_for_database(database):
    starting_token = None
    next_page = True
    tables = []
    while next_page:
        paginator = glue_client.get_paginator(operation_name="get_tables")
        response_iterator = paginator.paginate(
            DatabaseName=database,
            PaginationConfig={"PageSize": 100, "StartingToken": starting_token},
        )
        for elem in response_iterator:
            tables += [
                {
                    "name": table["Name"],
                }
                for table in elem["TableList"]
            ]
            try:
                starting_token = elem["NextToken"]
            except:
                next_page = False
    return tables

and then do invoke the method to list out the tables for a given database:

for table in get_tables_for_database(database):
        print(f"Table: {table['name']}")

If you want to list the tables for every database out there in Glue, you may have to do an additional for loop in order to retrieve the databases first, and then extract the tables using the above snippet as your inner loop for each database.

Upvotes: 2

Related Questions