Reputation: 4110

Remove duplicates in large MySql table

I have a question about MySql. I have a table with 7.479.194 records. Some records are duplicated. I would like to do this:

insert into new_table 
  select * 
    from old_table 
group by old_table.a, old_table.b

so I would take out the duplicated entries...but problem is that this is a large amount of data. The table is MyIsam.

This is example data- I would like to group it by city, short_ccode...

id          city      post_code        short_ccode
----------------------------------------------------
4732875     Celje     3502             si
4733306     Celje     3502             si
4734250     Celje     3502             si

I suppose I have to modify my.ini file for some memory for group by statement...which settings are responsible for that?

I have a machine with 3bg of RAM and 2Ghz processor.

My ini file:

# aaaMySQL Server Instance Configuration File
# ----------------------------------------------------------------------
# Generated by the MySQL Server Instance Configuration Wizard
#
#
# Installation Instructions
# ----------------------------------------------------------------------
#
# On Linux you can copy this file to /etc/my.cnf to set global options,
# mysql-data-dir/my.cnf to set server-specific options
# (@localstatedir@ for this installation) or to
# ~/.my.cnf to set user-specific options.
#
# On Windows you should keep this file in the installation directory 
# of your server (e.g. C:\Program Files\MySQL\MySQL Server 4.1). To
# make sure the server reads the config file use the startup option 
# "--defaults-file". 
#
# To run run the server from the command line, execute this in a 
# command line shell, e.g.
# mysqld --defaults-file="C:\Program Files\MySQL\MySQL Server 4.1\my.ini"
#
# To install the server as a Windows service manually, execute this in a 
# command line shell, e.g.
# mysqld --install MySQL41 --defaults-file="C:\Program Files\MySQL\MySQL Server 4.1\my.ini"
#
# And then execute this in a command line shell to start the server, e.g.
# net start MySQL41
#
#
# Guildlines for editing this file
# ----------------------------------------------------------------------
#
# In this file, you can use all long options that the program supports.
# If you want to know the options a program supports, start the program
# with the "--help" option.
#
# More detailed information about the individual options can also be
# found in the manual.
#
#
# CLIENT SECTION
# ----------------------------------------------------------------------
#
# The following options will be read by MySQL client applications.
# Note that only client applications shipped by MySQL are guaranteed
# to read this section. If you want your own MySQL client program to
# honor these values, you need to specify it as an option during the
# MySQL client library initialization.
#
[client]

port=3306


# SERVER SECTION
# ----------------------------------------------------------------------
#
# The following options will be read by the MySQL Server. Make sure that
# you have installed the server correctly (see above) so it reads this 
# file.
#
[wampmysqld]

# The TCP/IP Port the MySQL Server will listen on
port=3306


#Path to installation directory. All paths are usually resolved relative to this.
basedir=d:/wamp/bin/mysql/mysql5.0.45

#log file
log-error=d:/wamp/logs/mysql.log

#Path to the database root
datadir=d:/wamp/bin/mysql/mysql5.0.45/data

# The default character set that will be used when a new schema or table is
# created and no character set is defined
default-character-set=utf8

# The default storage engine that will be used when create new tables when
default-storage-engine=MyISAM

# The maximum amount of concurrent sessions the MySQL server will
# allow. One of these connections will be reserved for a user with
# SUPER privileges to allow the administrator to login even if the
# connection limit has been reached.
max_connections=1000

# Query cache is used to cache SELECT results and later return them
# without actual executing the same query once again. Having the query
# cache enabled may result in significant speed improvements, if your
# have a lot of identical queries and rarely changing tables. See the
# "Qcache_lowmem_prunes" status variable to check if the current value
# is high enough for your load.
# Note: In case your tables change very often or if your queries are
# textually different every time, the query cache may result in a
# slowdown instead of a performance improvement.
query_cache_size=16M

# The number of open tables for all threads. Increasing this value
# increases the number of file descriptors that mysqld requires.
# Therefore you have to make sure to set the amount of open files
# allowed to at least 4096 in the variable "open-files-limit" in
# section [mysqld_safe]
table_cache=500

# Maximum size for internal (in-memory) temporary tables. If a table
# grows larger than this value, it is automatically converted to disk
# based table This limitation is for a single table. There can be many
# of them.
tmp_table_size=32M


# How many threads we should keep in a cache for reuse. When a client
# disconnects, the client's threads are put in the cache if there aren't
# more than thread_cache_size threads from before.  This greatly reduces
# the amount of thread creations needed if you have a lot of new
# connections. (Normally this doesn't give a notable performance
# improvement if you have a good thread implementation.)
thread_cache_size=12

#*** MyISAM Specific options

# The maximum size of the temporary file MySQL is allowed to use while
# recreating the index (during REPAIR, ALTER TABLE or LOAD DATA INFILE.
# If the file-size would be bigger than this, the index will be created
# through the key cache (which is slower).
myisam_max_sort_file_size=100G

# If the temporary file used for fast index creation would be bigger
# than using the key cache by the amount specified here, then prefer the
# key cache method.  This is mainly used to force long character keys in
# large tables to use the slower key cache method to create the index.
myisam_max_extra_sort_file_size=100G

# If the temporary file used for fast index creation would be bigger
# than using the key cache by the amount specified here, then prefer the
# key cache method.  This is mainly used to force long character keys in
# large tables to use the slower key cache method to create the index.
myisam_sort_buffer_size=32M

# Size of the Key Buffer, used to cache index blocks for MyISAM tables.
# Do not set it larger than 30% of your available memory, as some memory
# is also required by the OS to cache rows. Even if you're not using
# MyISAM tables, you should still set it to 8-64M as it will also be
# used for internal temporary disk tables.
key_buffer_size=64M

# Size of the buffer used for doing full table scans of MyISAM tables.
# Allocated per thread, if a full scan is needed.
read_buffer_size=2M
read_rnd_buffer_size=8MK

# This buffer is allocated when MySQL needs to rebuild the index in
# REPAIR, OPTIMZE, ALTER table statements as well as in LOAD DATA INFILE
# into an empty table. It is allocated per thread so be careful with
# large settings.
sort_buffer_size=256M


#*** INNODB Specific options ***


# Use this option if you have a MySQL server with InnoDB support enabled
# but you do not plan to use it. This will save memory and disk space
# and speed up some things.
#skip-innodb

# Additional memory pool that is used by InnoDB to store metadata
# information.  If InnoDB requires more memory for this purpose it will
# start to allocate it from the OS.  As this is fast enough on most
# recent operating systems, you normally do not need to change this
# value. SHOW INNODB STATUS will display the current amount used.
innodb_additional_mem_pool_size=20M

# If set to 1, InnoDB will flush (fsync) the transaction logs to the
# disk at each commit, which offers full ACID behavior. If you are
# willing to compromise this safety, and you are running small
# transactions, you may set this to 0 or 2 to reduce disk I/O to the
# logs. Value 0 means that the log is only written to the log file and
# the log file flushed to disk approximately once per second. Value 2
# means the log is written to the log file at each commit, but the log
# file is only flushed to disk approximately once per second.
innodb_flush_log_at_trx_commit=1

# The size of the buffer InnoDB uses for buffering log data. As soon as
# it is full, InnoDB will have to flush it to disk. As it is flushed
# once per second anyway, it does not make sense to have it very large
# (even with long transactions).
innodb_log_buffer_size=8M

# InnoDB, unlike MyISAM, uses a buffer pool to cache both indexes and
# row data. The bigger you set this the less disk I/O is needed to
# access data in tables. On a dedicated database server you may set this
# parameter up to 80% of the machine physical memory size. Do not set it
# too large, though, because competition of the physical memory may
# cause paging in the operating system.  Note that on 32bit systems you
# might be limited to 2-3.5G of user level memory per process, so do not
# set it too high.
innodb_buffer_pool_size=512M

# Size of each log file in a log group. You should set the combined size
# of log files to about 25%-100% of your buffer pool size to avoid
# unneeded buffer pool flush activity on log file overwrite. However,
# note that a larger logfile size will increase the time needed for the
# recovery process.
innodb_log_file_size=10M

# Number of threads allowed inside the InnoDB kernel. The optimal value
# depends highly on the application, hardware as well as the OS
# scheduler properties. A too high value may lead to thread thrashing.
innodb_thread_concurrency=8



[mysqld]
port=3306

Upvotes: 3

Answers (6)

Vladimir Kroz

Reputation: 5367

From my experience when your table grows to number of millions records and more the most effective way to handle duplicates will: 1) export data to text files 2) sort in file 3) remove duplicates in file 4) load back to database

With increasing size of the data this approach works eventually faster than any SQL query you may invent

Upvotes: 0

DrColossos

Reputation: 12998

MySQL has a INSERT IGNORE. From the docs:

[...] however, when INSERT IGNORE is used, the insert operation fails silently for the row containing the unmatched value, but any rows that are matched are inserted.

So you could use your query from above b just adding a IGNORE

Upvotes: 1

nikc.org

Reputation: 16962

A bit dirty maybe, but it has done the trick for me the few times that I've needed it: Remove duplicate entries in MySQL.

Basically, you simply create a unique index consisting of all the columns that you wan't to be unique in the table.

As always before this kind of procedures, a backup before proceeding is recommended.

Upvotes: 0

baklarz2048

Reputation: 10938

You don't need to group data. Try this:

 delete from old_table
    USING old_table, old_table as vtable  
    WHERE (old_table.id > vtable.id)  
    AND (old_table.city=vtable.city AND 
old_table.post_code=vtable.post_code 
AND old_table.short_code=vtable.short_code)

I can't comment posts becouse of my points ... repair table old_table; next: show:

EXPLAIN SELECT old_table.id FROM   old_table, old_table as vtable  
        WHERE (old_table.id > vtable.id)  
        AND (old_table.city=vtable.city AND 
    old_table.post_code=vtable.post_code 
    AND old_table.short_code=vtable.short_code

Show: os~> ulimit -a; mysql>SHOW VARIABLES LIKE 'open_files_limit';

next: Remove all os restrictions form the mysql process.

ulimit -n 1024 etc.

Upvotes: 0

Beel

Reputation: 470

To avoid the memory issue, avoid the big select by having a small external program, using the logic as below. First, backup your database. Then:

do {
# find a record
x=sql: select * from table1 limit 1;
if (null x)
then
 exit # no more data in table1
fi
insert x into table2

# find the value of the field that should NOT be duplicated
a=parse(x for table1.a)
# delete all such entries from table1
sql: delete * from table1 where a='$a';

}

Upvotes: 0

OMG Ponies

Reputation: 332591

This will populate NEW_TABLE with unique values, and the id value is the first id of the bunch:

INSERT INTO NEW_TABLE
  SELECT MIN(ot.id),
         ot.city,
         ot.post_code,
         ot.short_ccode
    FROM OLD_TABLE ot
GROUP BY ot.city, ot.post_code, ot.short_ccode

If you want the highest id value per bunch:

INSERT INTO NEW_TABLE
  SELECT MAX(ot.id),
         ot.city,
         ot.post_code,
         ot.short_ccode
    FROM OLD_TABLE ot
GROUP BY ot.city, ot.post_code, ot.short_ccode

Upvotes: 6

Remove duplicates in large MySql table

Answers (6)

Related Questions