Adding unique ID for each row in scala dataframe for multiple insertions

Question

I am trying to unique ID for each row in my scala dataframe and hence I can insert dataframe from databricks notebook into SQL DB.

val df2 = df1.withColumn("unique_ID",monotonicallyIncreasingId)

This works for the first ingestion into SQL DB. But when I try to ingest the new data, I get the duplicate key error "The duplicate key value is..XXXX"

How to overcome to generate unique key for every SQL ingestion? Thanks.

stefanobaghino · Accepted Answer

Rather than adding the identifier yourself by hand (which I imagine is failing because monotonicallyIncreasingId always starts from 0 even if that is already stored on the database you're trying to save to) you can probably add an auto-increasing identifier column to the schema of the database you're saving on. Every RDBMS has its own way to allow you to do that, this page shows how to do it on a selection of SQL database implementations. For example, in MySQL you would add the AUTO_INCREMENT qualifier to a column:

CREATE TABLE Persons (
    Personid int NOT NULL AUTO_INCREMENT,
    LastName varchar(255) NOT NULL,
    FirstName varchar(255),
    Age int,
    PRIMARY KEY (Personid)
);

When saving the dataframe you would not need to specify the auto-increasing identifier (i.e. in the example above your dataframe ought to contain only LastName, FirstName and `Age~

Adding unique ID for each row in scala dataframe for multiple insertions

Answers (1)

Related Questions