arbind Sah
arbind Sah

Reputation: 1

How to integrate dynamodb with pypspark or databricks?

I am trying to connect amazon dynamodb with pyspark and i am following this documentation to do so but in output i am not getting any data in my table. This is my code to connect dynamodb with pyspark

import os

from pyspark.sql import SparkSession

os.environ["AWS_ACCESS_KEY"] ="myAccessKey" 
os.environ["AWS_SECRET_ACCESS_KEY"] = "mySecretKey"
spark = SparkSession.builder \
    .appName("SQL Server to PySpark") \
    .config("spark.jars.packages","com.audienceproject:spark-dynamodb_2.12:1.1.2") \

# Load a DataFrame from a Dynamo table. Only incurs the cost of a single scan for schema inference.
df  ="tableName", "test_db") \
                     .format("dynamodb") \

In output i am getting this 
24/10/07 12:05:19 WARN Utils: Your hostname, arvind-ThinkPad-T480s resolves to a loopback address:; using instead (on interface wlp61s0)
24/10/07 12:05:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/arvind/Desktop/pyspark/spark-3.5.0-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/arvind/.ivy2/cache
The jars for the packages stored in: /home/arvind/.ivy2/jars
com.audienceproject#spark-dynamodb_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5377cf76-fb30-4e81-9f86-88e9a7474777;1.0
        confs: [default]
        found com.audienceproject#spark-dynamodb_2.12;1.1.2 in central
        found com.amazonaws#aws-java-sdk-sts;1.11.678 in central
        found com.amazonaws#aws-java-sdk-core;1.11.678 in central
        found commons-logging#commons-logging;1.1.3 in central
        found org.apache.httpcomponents#httpclient;4.5.9 in central
        found org.apache.httpcomponents#httpcore;4.4.11 in central
        found commons-codec#commons-codec;1.11 in central
        found;1.0.2 in central
        found com.fasterxml.jackson.core#jackson-databind; in central
        found com.fasterxml.jackson.core#jackson-annotations;2.6.0 in central
        found com.fasterxml.jackson.core#jackson-core;2.6.7 in central
        found com.fasterxml.jackson.dataformat#jackson-dataformat-cbor;2.6.7 in central
        found joda-time#joda-time;2.8.1 in central
        found com.amazonaws#jmespath-java;1.11.678 in central
        found com.amazonaws#aws-java-sdk-dynamodb;1.11.678 in central
        found com.amazonaws#aws-java-sdk-s3;1.11.678 in central
        found com.amazonaws#aws-java-sdk-kms;1.11.678 in central
        found org.slf4j#slf4j-api;1.7.25 in central
:: resolution report :: resolve 602ms :: artifacts dl 28ms
        :: modules in use:
        com.amazonaws#aws-java-sdk-core;1.11.678 from central in [default]
        com.amazonaws#aws-java-sdk-dynamodb;1.11.678 from central in [default]
        com.amazonaws#aws-java-sdk-kms;1.11.678 from central in [default]
        com.amazonaws#aws-java-sdk-s3;1.11.678 from central in [default]
        com.amazonaws#aws-java-sdk-sts;1.11.678 from central in [default]
        com.amazonaws#jmespath-java;1.11.678 from central in [default]
        com.audienceproject#spark-dynamodb_2.12;1.1.2 from central in [default]
        com.fasterxml.jackson.core#jackson-annotations;2.6.0 from central in [default]
        com.fasterxml.jackson.core#jackson-core;2.6.7 from central in [default]
        com.fasterxml.jackson.core#jackson-databind; from central in [default]
        com.fasterxml.jackson.dataformat#jackson-dataformat-cbor;2.6.7 from central in [default]
        commons-codec#commons-codec;1.11 from central in [default]
        commons-logging#commons-logging;1.1.3 from central in [default]
        joda-time#joda-time;2.8.1 from central in [default]
        org.apache.httpcomponents#httpclient;4.5.9 from central in [default]
        org.apache.httpcomponents#httpcore;4.4.11 from central in [default]
        org.slf4j#slf4j-api;1.7.25 from central in [default];1.0.2 from central in [default]
        :: evicted modules:
        commons-logging#commons-logging;1.2 by [commons-logging#commons-logging;1.1.3] in [default]
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        |      default     |   19  |   0   |   0   |   1   ||   18  |   0   |
:: retrieving :: org.apache.spark#spark-submit-parent-5377cf76-fb30-4e81-9f86-88e9a7474777
        confs: [default]
        0 artifacts copied, 18 already retrieved (0kB/16ms)
24/10/07 12:05:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/07 12:05:31 WARN V2ScanPartitioningAndOrdering: Spark ignores the partitioning OutputPartitioning. Please use KeyGroupedPartitioning for better performance


| |



why i am getting this issue is this driver issue? Also if there is another way to connect then please explain that.

Upvotes: 0

Views: 137

Answers (1)

Prasad Nadiger
Prasad Nadiger

Reputation: 353

Try adding the S3 Access key and S3 Secret Key in the Spark Session Config:

# Configure Spark session
conf = SparkSession.builder \
    .appName("SQL Server to PySpark") \
    .config("spark.jars.packages", "com.audienceproject:spark-dynamodb_2.12:1.1.2") \
    .config("spark.hadoop.fs.s3a.access.key", os.environ["AWS_ACCESS_KEY"]) \
    .config("spark.hadoop.fs.s3a.secret.key", os.environ["AWS_SECRET_ACCESS_KEY"])

Also check if the AWS IAM role attached to the service you are running, has all the required access.

Upvotes: -1

Related Questions