Valter Silva
Valter Silva

Reputation: 16656

How to programmatically transfer a lot of data between tables?

i have two tables where in the first one i have 14 millions and in the second one i have 1.5 million of data.

So i wonder how could i transfer this data to another table to be normalized ? And how do i convert some type to another, for example: i have a field called 'year' but its type is varchar, but i want it an integer instead, how do i do that ?

I thought about do this using JDBC in a loop while from java, but i think this is not effeciently.

// 1.5 million of data
CREATE TABLE dbo.directorsmovies
    (
    movieid    INT NULL,
    directorid INT NULL,
    dname      VARCHAR (500) NULL,
    addition   VARCHAR (1000) NULL
    )

//14 million of data
CREATE TABLE dbo.movies
    (
    movieid      VARCHAR (20) NULL,
    title        VARCHAR (400) NULL,
    mvyear       VARCHAR (100) NULL,
    actorid      VARCHAR (20) NULL,
    actorname    VARCHAR (250) NULL,
    sex          CHAR (1) NULL,
    as_character VARCHAR (1500) NULL,
    languages    VARCHAR (1500) NULL,
    genres       VARCHAR (100) NULL
    )

And this is my new tables:

DROP TABLE actor
CREATE TABLE actor (
    id INT PRIMARY KEY IDENTITY,
    name VARCHAR(200) NOT NULL, 
    sex VARCHAR(1) NOT NULL
)

DROP TABLE actor_character
CREATE TABLE actor_character(
    id INT PRIMARY KEY IDENTITY,
    character VARCHAR(100)
)

DROP TABLE director
CREATE TABLE director(
    id INT PRIMARY KEY IDENTITY,
    name VARCHAR(200) NOT NULL,
    addition VARCHAR(150)
)


DROP TABLE movie
CREATE TABLE movie(
    id INT PRIMARY KEY IDENTITY,
    title VARCHAR(200) NOT NULL,
    year INT
)


DROP TABLE language
CREATE TABLE language(
    id INT PRIMARY KEY IDENTITY,
    language VARCHAR (100) NOT NULL
)

DROP TABLE genre
CREATE TABLE genre(
    id INT PRIMARY KEY IDENTITY,
    genre VARCHAR(100) NOT NULL
)

DROP TABLE director_movie
CREATE TABLE director_movie(
    idDirector INT,
    idMovie INT,
    CONSTRAINT fk_director_movie_1 FOREIGN KEY (idDirector) REFERENCES director(id),
    CONSTRAINT fk_director_movie_2 FOREIGN KEY (idMovie) REFERENCES movie(id),
    CONSTRAINT pk_director_movie PRIMARY KEY(idDirector,idMovie)
)

DROP TABLE genre_movie
CREATE TABLE genre_movie(
    idGenre INT,
    idMovie INT,
    CONSTRAINT fk_genre_movie_1 FOREIGN KEY (idMovie) REFERENCES movie(id),
    CONSTRAINT fk_genre_movie_2 FOREIGN KEY (idGenre) REFERENCES genre(id),
    CONSTRAINT pk_genre_movie PRIMARY KEY (idMovie, idGenre)
)

DROP TABLE language_movie
CREATE TABLE language_movie(
    idLanguage INT,
    idMovie INT,
    CONSTRAINT fk_language_movie_1 FOREIGN KEY (idLanguage) REFERENCES language(id),
    CONSTRAINT fk_language_movie_2 FOREIGN KEY (idMovie) REFERENCES movie(id),
    CONSTRAINT pk_language_movie PRIMARY KEY (idLanguage, idMovie)  
)

DROP TABLE movie_actor
CREATE TABLE movie_actor(
    idMovie INT,
    idActor INT,
    CONSTRAINT fk_movie_actor_1 FOREIGN KEY (idMovie) REFERENCES movie(id),
    CONSTRAINT fk_movie_actor_2 FOREIGN KEY (idActor) REFERENCES actor(id),
    CONSTRAINT pk_movie_actor PRIMARY KEY (idMovie,idActor)
)

UPDATE: I'm using SQL Server 2008. Sorry guys i forgot to mention that are different databases :

The not normalized is call disciplinedb and the my normalized call imdb.

Best regards, Valter Henrique.

Upvotes: 3

Views: 1151

Answers (3)

Ben Thul
Ben Thul

Reputation: 32667

I just recently did this for ~150 Gb of data. I used a pair of merge statements for each table. The first merge statement said "if it's not in the destination table, copy it there" and the second said "if it's in the destination table, delete it from the source". I put both in a while loop and only did 10000 rows in each operation at a time. Keeping it on the server (and not transferring it through a client) is going to be a huge boon for performance. Give it a shot!

Upvotes: 1

Laas
Laas

Reputation: 6058

Though my tables were dwarfs compared to yours, I got over this kind of problem once with stored procedures. For MySQL, below is a simplified (and untested) essence of my script, but something similar should work with all major SQL bases.

First you should just add a new integer year column (int_year in example) and then iterate over all rows using the procedure below:

DROP PROCEDURE IF EXISTS move_data;
CREATE PROCEDURE move_data()
BEGIN
  DECLARE done INT DEFAULT 0;
  DECLARE orig_id INT DEFAULT 0;
  DECLARE orig_year VARCHAR DEFAULT "";
  DECLARE cur1 CURSOR FOR SELECT id, year FROM table1;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;

  OPEN cur1;

  PREPARE stmt FROM "UPDATE table1 SET int_year = ? WHERE id = ?";

  read_loop: LOOP
    FETCH cur1 INTO orig_id, orig_year;
    IF done THEN
      LEAVE read_loop;
    END IF;

    SET @year= orig_year;
    SET @id = orig_id;

    EXECUTE stmt USING @orig_year, @id;
  END LOOP;
  CLOSE cur1;
END;

And to start the procedure, just CALL move_data().

The above SQL has two major ideas to speed it up:

  1. Use CURSORS to iterate over a large table
  2. Use PREPARED statement to quickly execute pre-known commands

PS. for my case this speeded things up from ages to seconds, though in your case it can still take a considerable amount of time. So it would be probably best to execute from command line, not some web interface (e.g. PhpMyAdmin).

Upvotes: 1

wallyk
wallyk

Reputation: 57764

If both tables are in the same database, then the most efficient transfer is to do it all within the database, preferably by sending a SQL statement to be executed there.

Any movement of data from the d/b server to somewhere else and then back to the d/b server is to be avoided unless there is a reason it can only be transformed off-server. If the destination is different server, then this is much less of an issue.

Upvotes: 2

Related Questions