pt-table-sync,有效地同步MySQL表数据
在大量的使用案例证明,该工具已经成熟了(作者是这么承诺的),而且经过了良好的测试。但使用不当依旧会有问题。
相关文章:
博客: http://www.drdobbs.com/
PostgreSQL表内容对照工具: https://www.cri.ensmp.fr/people/coelho/pg_comparator/
远程MySQL表对照: http://www.perlmonks.org/?node_id=381053
远程MySQL表对照技术解释: http://cri.ensmp.fr/classement/doc/A-375.pdf
MySQL数据比较算法: http://tinyurl.com/mysql-data-diff-algorithm
webyog中的SQLyog Job Agent(SJA,表同步工具): https://www.webyog.com/
SJA和pt-table-sync的比较: http://tinyurl.com/maatkit-vs-sqlyog
使用前要注意的问题及风险。由于pt-table-sync会修改数据,所以使用的使用要小心。注意以下几点:
- 完整阅读了工具手册,你要清楚自己在干嘛。
- 熟知工具的BUGS,手册里都有BUG的描述。
- 一定要在非生产环境进行了测试。
- 备份生产数据,并且验证备份数据。
- 一定要先运行”–dry-run” 和 “–print”来测试同步。
命令语法
pt-table-sync [OPTIONS] DSN [DSN]
synchronizing a server that is a replication slave with the “–replicate” or “–sync-to-master”
methods, it always makes the changes on the replication master, never the replication slave directly.
This is in general the only safe way to bring a replica back in sync with its master; changes to the
replica are usually the source of the problems in the first place. However, the changes it makes on
the master should be no-op changes that set the data to their current values, and actually affect only
the replica.
–sync-to-master slave1
suspected to be incorrect:
make changes directly on master2, which will then flow through replication and change master1’s data:
pt-table-sync –execute h=master1,D=db,t=tbl master2
DESCRIPTION
table structures, indexes, or any other schema objects. The following describes one-way
synchronization. “BIDIRECTIONAL SYNCING” is described later.
should understand three things: the purpose of “–replicate”, finding differences, and specifying
hosts. These three concepts are closely related and determine how the tool will run. The following is
the abbreviated logic:
The DSN is a slave. Connect to its master and sync.
of differences, and fix.
find records of differences, and fix.
filter with –databases etc, and sync each table to the master.
DSN to the first.
“–replicate” which causes pt-table-sync to automatically find differences efficiently with one of
several algorithms (see “ALGORITHMS”). Alternatively, the value of “–replicate”, if specified, causes
pt-table-sync to use the differences already found by having previously ran pt-table-checksum with its
own “–replicate” option. Strictly speaking, you don’t need to use “–replicate” because pt-table-sync
can find differences, but many people use “–replicate” if, for example, they checksum regularly using
pt-table-checksum then fix differences as needed with pt-table-sync. If you’re unsure, read each
tool’s documentation carefully and decide for yourself, or consult with an expert.
two ways: with “–sync-to-master” or without. Specifying “–sync-to-master” makes pt-table-sync expect
one and only slave DSN on the command line. The tool will automatically discover the slave’s master
and sync it so that its data is the same as its master. This is accomplished by making changes on the
master which then flow through replication and update the slave to resolve its differences. Be careful
though: although this option specifies and syncs a single slave, if there are other slaves on the same
master, they will receive via replication the changes intended for the slave that you’re trying to
sync.
source host. There is only ever one source host. If you do not also specify “–replicate”, then you
must specify at least one other DSN as the destination host. There can be one or more destination
hosts. Source and destination hosts must be independent; they cannot be in the same replication
topology. pt-table-sync will die with an error if it detects that a destination host is a slave
because changes are written directly to destination hosts (and it’s not safe to write directly to
slaves). Or, if you specify “–replicate” (but not “–sync-to-master”) then pt-table-sync expects one
and only one master DSN on the command line. The tool will automatically discover all the master’s
slaves and sync them to the master. This is the only way to sync several (all) slaves at once (because
“–sync-to-master” only specifies one slave).
“–sync-to-master”) provides default values for other DSNs, whether those other DSNs are specified on
the command line or auto-discovered by the tool. So in this example,
to see how pt-table-sync will interpret the DSNs given on the command line.
LIMITATIONS
“–replicate” option. Therefore it will set “binlog_format=STATEMENT” on the master for its
session if required. To do this user must have “SUPER” privilege.
OUTPUT
There is one row per table. Each server is printed separately. For example,
# DELETE REPLACE INSERT UPDATE ALGORITHM START END EXIT DATABASE.TABLE
# 0 0 3 0 Chunk 13:00:00 13:00:17 2 test.test1
(see “ALGORITHMS”). The sync operation for this table started at 13:00:00 and ended 17 seconds later
(times taken from “NOW()” on the source host). Because differences were found, its “EXIT STATUS” was
2.
synchronize the table if “–execute” is also specified.
etc., then specify “–print” once and “–verbose” twice. Be careful though: this can print a lot of
SQL statements.
differences without violating some unique key. For example, suppose there’s a primary key on column a
and a unique key on column b. Then there is no way to sync these two tables with straightforward
UPDATE statements:
| a | b | | a | b |
+—+—+ +—+—+
| 1 | 2 | | 1 | 1 |
| 2 | 1 | | 2 | 2 |
+—+—+ +—+—+
the first index violation, so you don’t have to worry about it.
inherently tricky, and it’s easy to make mistakes. You need to be sure you’re using the tool correctly
for master-master replication. See the “SYNOPSIS” for the overview of the correct usage.
definitions because these might cause unintended changes on the child tables. See
“–[no]check-child-tables”.
can synchronize data in tables lacking a primary key or unique index, it might be best to synchronize
that data by another means.
REPLICATION SAFETY
all sorts of issues to think about, such as other processes changing data, trying to change data on the
slave, whether the destination and source are a master-master pair, and much more.
replication to the slave like any other changes. However, this works only if it’s possible to REPLACE
into the table on the master. REPLACE works only if there’s a unique index on the table (otherwise it
just acts like an ordinary INSERT).
sync a slave to its master. This will generally do the right thing. When there is no unique key on
the table, there is no choice but to change the data on the slave, and pt-table-sync will detect that
you’re trying to do so. It will complain and die unless you specify “–no-check-slave” (see
“–[no]check-slave”).
data on the destination server. Therefore, you need to specify “–no-bin-log” for safety (see
“–[no]bin-log”). If you don’t, the changes you make on the destination server will replicate back to
the source server and change the data there!
don’t change the data on the destination server. You will also need to specify “–no-check-slave” to
keep pt-table-sync from complaining that it is changing data on a slave.
ALGORITHMS
The tool automatically chooses the best algorithm for each table based on indexes, column types, and
the algorithm preferences specified by “–algorithms”. The following algorithms are available, listed
in their default order of preference:
column’s range of values into chunks of approximately “–chunk-size” rows. Syncs a chunk at a time
by checksumming the entire chunk. If the chunk differs on the source and destination, checksums
each chunk’s rows individually to find the rows that differ.
right size.
consumption. If a chunk’s rows must be examined, only the primary key columns and a checksum are
sent over the network, not the entire row. If a row is found to be different, the entire row will
be fetched, but not before.
the same character. In that case, the tool will exit and suggest picking a different algorithm.
backtracking algorithm (see pt-archiver for more on this algorithm). It is very similar to
“Chunk”, but instead of pre-calculating the boundaries of each piece of the table based on index
cardinality, it uses “LIMIT” to define each nibble’s upper limit, and the previous nibble’s upper
limit to define the lower limit.
the next query checksums the entire nibble. If the nibble differs between the source and
destination, it examines the nibble row-by-row, just as “Chunk” does.
columns, and if they’re the same, compares the COUNT(*) column’s value to determine how many rows
to insert or delete into the destination. Works on tables with no primary key or unique index.
less efficient than the other algorithms, but works when there is no suitable index for them to
use.
versions of this tool), DrillDown (what I originally called top-down), and GroupByPrefix (similar
to how SqlYOG Job Agent works). Each algorithm has strengths and weaknesses. If you’d like to
implement your favorite technique for finding differences between two sources of data on possibly
different servers, I’m willing to help. The algorithms adhere to a simple interface that makes it
pretty easy to write your own.
BIDIRECTIONAL SYNCING
strict limitations:
* does not work in any way with replication
* requires that the table(s) are chunkable with the Chunk algorithm
* is not N-way, only bidirectional between two servers at a time
* does not handle DELETE changes
the other servers (viz. r1 and r2 are not slaves to c1). r1 and r2 are remote servers. Rows in table
foo are updated and inserted on all three servers and we want to synchronize all the changes between
all the servers. Table foo has columns:
ts timestamp auto updated
name varchar
(id) values. In general, newer rows, as determined by the ts column, take precedence when a same but
differing row is found during the bidirectional sync. “Same but differing” means that two rows have
the same primary key (id) value but different values for some other column, like the name column in
this example. Same but differing conflicts are resolved by a “conflict”. A conflict compares some
column of the competing rows to determine a “winner”. The winning row becomes the source and its
values are used to update the other row.
should be familiar with: chunk column (“–chunk-column”), comparison column(s) (“–columns”), and
conflict column (“–conflict-column”). The chunk column is only used to chunk the table; e.g. “WHERE
id >= 5 AND id < 10″. Chunks are checksummed and when chunk checksums reveal a difference, the tool
selects the rows in that chunk and checksums the “–columns” for each row. If a column checksum
differs, the rows have one or more conflicting column values. In a traditional unidirectional sync,
the conflict is a moot point because it can be resolved simply by updating the entire destination row
with the source row’s values. In a bidirectional sync, however, the “–conflict-column” (in accordance
with other “–conflict-*” options list below) is compared to determine which row is “correct” or
“authoritative”; this row becomes the “source”.
and r1, then syncs c1 and r2 including any changes from r1. At this point c1 and r2 are completely in
sync, but r1 is missing any changes from r2 because c1 didn’t have these changes when it and r1 were
synced. So a second run is needed which syncs the servers in the same order, but this time when c1 and
r1 are synced r1 gets r2’s changes.
each subsequent DSN in turn. So the tool in this example would be ran twice like:
You must specify other options that tell pt-table-sync how to resolve conflicts for same but differing
rows. These options are:
* –conflict-comparison
* –conflict-value
* –conflict-threshold
* –conflict-error”> (optional)
saying on which host the statement would be executed if you used “–execute”.
“right” server. Since either server can become the source or destination it’s confusing to think of
them as “src” and “dst”. Therefore, they’re generically referred to as left and right. It’s easy to
remember this because the first DSN is always to the left of the other server DSNs on the command line.
EXIT STATUS
finishes and exits.
0 Success.
1 Internal error.
2 At least one table differed on the destination.
3 Combination of 1 and 2.
OPTIONS
details.
the order that they’re given. The first algorithm that can sync the table is used. See
“ALGORITHMS”.
execute the queries and place them in a temporary table internally before sending the results back
to pt-table-sync. The advantage of this strategy is that pt-table-sync can fetch rows as desired
without using a lot of memory inside the Perl process, while releasing locks on the MySQL table (to
reduce contention with other queries). The disadvantage is that it uses more memory on the MySQL
server instead.
table and then fetching it all into Perl’s memory is probably a silly thing to do. This option is
most useful for the GroupBy and Stream algorithms, which may fetch a lot of data from the server.
until the tool fetches them. This allows the tool to use less memory but may keep the rows locked
on the server longer.
which causes MySQL to send all selected rows to the tool at once. This may result in the results
“cursor” being held open for a shorter time on the server, but if the tables are large, it could
take a long time anyway, and use all your memory.
mysql_enable_utf8 option to DBD::mysql, and runs SET NAMES UTF8 after connecting to MySQL. Any
other value sets binmode on STDOUT without the utf8 layer, and runs SET NAMES after connecting to
MySQL.
“–sync-to-master” is specified, the tool may sync tables using “REPLACE” statements. If a table
being synced has child tables with “ON DELETE CASCADE”, “ON UPDATE CASCADE”, or “ON UPDATE SET
NULL”, the tool prints an error and skips the table because “REPLACE” becomes “DELETE” then
“INSERT”, so the “DELETE” will cascade to the child table and delete its rows. In the worst case,
this can delete all rows in child tables!
tables, also specify “–no-foreign-key-checks” so MySQL will not cascade any operations from the
parent to child tables.
“–sync-to-master” is specified. “–print” does not check child tables.
CASCADE”, or “ON UPDATE SET NULL” foreign key constraint. There could be other affected child
tables.
sometimes you have to; “–replace” won’t work unless there’s a unique index, for example, so you
can’t make changes on the master in that scenario. By default pt-table-sync will complain if you
try to change data on a slave. Specify “–no-check-slave” to disable this check. Use it at your
own risk.
triggers will not be checked.
number of rows, or a data size. Data sizes are specified with a suffix of k=kibibytes,
M=mebibytes, G=gibibytes. Data sizes are converted to a number of rows by dividing by the average
row length.
command line.
to “–conflict-comparison”, “–conflict-value” and “–conflict-threshold” to determine which row
has the correct data and becomes the source. The column can be any type for which there is an
appropriate “–conflict-comparison” (this is almost all types except, for example, blobs).
Possible comparisons are one of these MAGIC_comparisons:
newest Newest temporal –conflict-column value
oldest Oldest temporal –conflict-column value
greatest Greatest numerical “–conflict-column value
least Least numerical –conflict-column value
equals –conflict-column value equal to –conflict-value
matches –conflict-column value matching Perl regex pattern
of error. Possible values are:
* die: Die, stop syncing, and print a warning to STDERR
between the two “–conflict-column” values is less than this amount. For example, if two
“–conflict-column” have timestamp values “2009-12-01 12:00:00” and “2009-12-01 12:05:00” the
difference is 5 minutes. If “–conflict-threshold” is set to “5m” the conflict will be resolved,
but if “–conflict-threshold” is set to “6m” the conflict will fail to resolve because the
difference is not greater than or equal to 6 minutes. In this latter case, “–conflict-error” will
report the failure.
or different server. This is not yet possible. “–databases” will not do it, and you can’t do it
with the D part of the DSN either because in the absence of a table name it assumes the whole
server should be synced and the D part controls only the connection’s default database.
you’ll see from actually running the tool, but there will be zeros for rows affected. This is
because the tool actually executes, but stops before it compares any data and just returns zeros.
The zeros do not mean there are no changes to be made.
created to resolve table differences. Therefore, the tables will be changed! And unless you also
specify “–verbose”, the changes will be made silently. If this is not what you want, see
“–print” or “–dry-run”.
options, and exit.
be rounded to the specified number of digits after the decimal point, with the ROUND() function in
MySQL. This can help avoid checksum mismatches due to different floating-point representations of
the same values on different MySQL versions and hardware. The default is no rounding; the values
are converted to strings by the CONCAT() function, and MySQL chooses the string representation. If
you specify a value of 2, for example, then the values 1.008 and 1.009 will be rounded to 1.01, and
will checksum as equal.
“FNV_64” user-defined function, “pt-table-sync” will detect it and prefer to use it, because it is
much faster than the built-ins. You can also use MURMUR_HASH if you’ve installed that user-defined
function. Both of these are distributed with Maatkit. See pt-table-checksum for more information
and benchmarks.
with “–print” and executed by “–execute”), binary columns are wrapped in HEX() so the binary data
does not produce an invalid SQL statement. You can disable this option but you probably shouldn’t.
tables, all columns in that row will be synced, regardless. (It is not currently possible to
exclude columns from the sync process itself, only from the comparison.)
using the index chosen by the sync algorithm or specified by “–chunk-index”. This is usually a
good thing, but in rare cases the index may not be the best for the query so you can suppress the
index hint by specifying “–no-index-hint” and let MySQL choose the index.
that “pt-table-sync” uses to select and compare rows.
The possible values are as follows:
===== =======================================================
0 Never lock tables.
1 Lock and unlock one time per sync cycle (as implemented
level of locking available. For example, the Chunk
algorithm will lock each chunk of C<N> rows, and then
unlock them if they are the same on the source and the
destination, before moving on to the next chunk.
3 Lock and unlock once for every server (DSN) synced, with
theory locking the table on the master should prevent any changes from taking place. (You are not
changing data on your slave, right?) If “–wait” is given, the master (source) is locked and then
the tool waits for the slave to catch up to the master before continuing.
implemented by beginning and committing transactions. The exception is if “–lock” is 3.
“–[no]transaction”.
ALTER TABLE, once the tables are reasonably in sync with each other (which you may choose to
accomplish via any number of means, including dump and reload or even something like pt-archiver).
It requires exactly two DSNs and assumes they are on the same server, so it does no waiting for
replication or the like. Tables are locked with LOCK TABLES.
backslash: “exam\,ple”
contains is different than the current PID. However, if the PID file exists and the PID it
contains is no longer running, the tool will overwrite the PID file with the current PID. The PID
file is removed automatically when the tool exits.
safe. These queries are valid SQL and you can run them yourself if you want to sync the tables
manually.
processlist SHOW PROCESSLIST
hosts SHOW SLAVE HOSTS
none Do not find slaves
method is required if the server uses a non-standard port (not 3306). Usually pt-table-sync does
the right thing and finds the slaves, but you may give a preferred method and it will be used
first. If it doesn’t find any slaves, the other methods will be tried.
table is exactly the same as the argument of the same name to pt-table-checksum. That is, it
contains records of which tables (and ranges of values) differ between the master and slave.
“pt-table-checksum” will sync that table, with the appropriate “WHERE” clause, to its master.
slave.
connect to the master as usual to sync.
is unable to find any slaves via “SHOW PROCESSLIST”, it will inspect “SHOW SLAVE HOSTS” instead.
You must configure each slave’s “report-host”, “report-port” and other options for this to work
right. After finding slaves, it will inspect the specified table on each slave to find data that
needs to be synced, and sync it.
slave as well. Any table that shows differences there will NOT be synced on the slave(s). For
example, suppose your replication is set up as A->B, B->C, B->D. Suppose you use this argument and
specify server B. The tool will examine server B’s copy of the table. If it looks like server B’s
data in table “test.tbl1” is different from server A’s copy, the tool will not sync that table on
servers C and D.
user with less privileges on the slaves but that user must exist on all slaves.
password for the user must be the same on all slaves.
“–set-vars wait_timeout=500” overrides the defaultvalue of 10000.
master, and treat the master as the source and the slave as the destination. Causes changes to be
made on the master. Sets “–wait” to 60 by default, sets “–lock” to 1 by default, and disables
“–[no]transaction” by default. See also “–replicate”, which changes this option’s behavior.
times out, the default behavior is to abort. This option makes the tool keep going anyway.
Warning: if you are trying to get a consistent comparison between the two servers, you probably
don’t want to keep going after a timeout.
enabled by default, but since “–lock” is disabled by default, it has no effect.
transactional locking (via “LOCK IN SHARE MODE” and “FOR UPDATE”, you must specify “–transaction”
explicitly.
whether to use transactions or table locks. It currently uses transactions on InnoDB tables, and
table locks on all others.
for InnoDB tables) and locking is controlled by “–lock”.
READ” and transactions are started “WITH CONSISTENT SNAPSHOT”.
5.0.
which vary in their handling of trailing spaces. MySQL 5.0 and later all retain trailing spaces in
“VARCHAR”, while previous versions would remove them.
the tool checks the version of other programs on the local system in addition to its own version.
For example, it checks the version of every MySQL server it connects to, Perl, and the Perl module
DBD::mysql. Second, it checks for and warns about versions with known problems. For example,
MySQL 5.5.25 had a critical bug and was re-released as 5.5.25a.
should never interfere with the normal operation of the tool.
value is the number of seconds to wait before timing out (see also “–timeout-ok”). Sets “–lock”
to 1 and “–[no]transaction” to 0 by default. If you see an error such as the following,
run with “–help”.
lagging on tables that are not being synced.
“–chunk-size” is specified. The purpose of the zero chunk is to capture a potentially large
number of zero values that would imbalance the size of the first chunk. For example, if a lot of
negative numbers were inserted into an unsigned integer column causing them to be stored as zeros,
then these zero values are captured by the zero chunk instead of the first chunk and all its non-
zero values.
DSN OPTIONS
case-sensitive, so P and p are not the same option. There cannot be whitespace before or after the “=”
and if the value contains whitespace it must be quoted. DSN options are comma-separated. See the
percona-toolkit manpage for full details.
backslash: “exam\,ple”
ENVIRONMENT
capture all output to a file, run the tool like:
SYSTEM REQUIREMENTS
version of Perl.
BUGS