第15章 InnoDB 存储引擎

目录

15.1 InnoDB介绍
15.1.1 InnoDB 表的好处
15.1.2 InnoDB 表最佳实践
15.1.3 验证InnoDB是默认存储引擎
15.1.4 InnoDB做基准测试
15.2 InnoDB 和ACID
15.3 InnoDB 多版本
15.4 InnoDB 架构
15.4.1 缓冲池
15.4.2 Change Buffer
15.4.3 自适应哈希索引
15.4.4 重做日志缓冲池
15.4.5 系统表空间
15.4.6 Doublewrite缓冲池
15.4.7 Undo 日志
15.4.8 File-Per-Table表空间
15.4.9 普通表空间
15.4.10 Undo 表空间
15.4.11 临时表空间
15.4.12 重做日志
15.5 InnoDB 锁和事务
15.5.1 InnoDB 锁
15.5.2 InnoDB 事务模式
15.5.3 InnoDB不同SQL语句设置锁
15.5.4 幻行
15.5.5 InnoDB中的死锁
15.6 InnoDB配置
15.6.1 InnoDB 启动配置
15.6.2 配置InnoDB为只读操作
15.6.3 InnoDB 缓冲池配置
15.6.4 配置InnoDB Change Buffering
15.6.5 配置InnoDB并发线程
15.6.6 配置InnoDB后台 I/O 线程数
15.6.7 在Linux上使用异步I/O
15.6.8 配置InnoDB Master线程 I/O 速度
15.6.9 配置自旋锁轮询
15.6.10 配置InnoDB清除调度
15.6.11 配置InnoDB优化器统计
15.6.12 配置索引页合并阈值
15.7 InnoDB 表空间
15.7.1 调整InnoDB系统表空间大小
15.7.2 改变InnoDB 重做日志文件大小和个数
15.7.3 系统表空间使用裸磁盘
15.7.4 InnoDB File-Per-Table 表空间
15.7.5 在数据目录外创建File-Per-Table 表空间
15.7.6 复制File-Per-Table 表空间到其他实例
15.7.7 配置 Undo 表空间
15.7.8 截断 Undo 表空间
15.7.9 InnoDB 普通表空间
15.7.10 InnoDB 表空间加密
15.8 InnoDB 表和索引
15.8.1 InnoDB 表
15.8.2 InnoDB 索引
15.9 InnoDB 表和页压缩
15.9.1 InnoDB 表压缩
15.9.2 InnoDB 页压缩
15.10 InnoDB 行存储和行格式
15.10.1 InnoDB 行存储概述
15.10.2 指定表的行格式
15.10.3 DYNAMIC 和 COMPRESSED 格式的行
15.10.4 COMPACT 和 REDUNDANT 格式的行
15.11 InnoDB 磁盘I/O和文件空间管理
15.11.1 InnoDB磁盘 I/O
15.11.2 文件空间管理
15.11.3 InnoDB 检查点
15.11.4 表的磁盘碎片整理
15.11.5 使用截断表回收磁盘空间
15.12 InnoDB 和在线 DDL
15.12.1 在线 DDL 概述
15.12.2 在线 DDL 的性能、并发性和空间需求
15.12.3 在线DDL SQL 语法
15.12.4 使用在线 DDL简化DDL语句
15.12.5 在线DDL实现的细节
15.12.6 在线 DDL 和故障恢复
15.12.7 分区表的在线DDL
15.12.8 在线 DDL 的限制
15.13 InnoDB 启动选项和系统变量
15.14 InnoDB INFORMATION_SCHEMA 表
15.14.1 InnoDB INFORMATION_SCHEMA 表关于压缩
15.14.2 InnoDB INFORMATION_SCHEMA 事务和锁定信息
15.14.3 InnoDB INFORMATION_SCHEMA 系统表
15.14.4 InnoDB INFORMATION_SCHEMA 全文索引表
15.14.5 InnoDB INFORMATION_SCHEMA 缓冲池表
15.14.6 InnoDB INFORMATION_SCHEMA 度量表
15.14.7 InnoDB INFORMATION_SCHEMA 临时表信息
15.14.8 从INFORMATION_SCHEMA.FILES检索InnoDB表空间元数据
15.15 InnoDB与MySQL Performance Schema集成
15.15.1 使用Performance Schema给InnoDB表监控ALTER TABLE进度
15.15.2 使用Performance Schema监控InnoDB 互斥等待
15.16 InnoDB 监控器
15.16.1 InnoDB 监控器类型
15.16.2 启用 InnoDB 监控器
15.16.3 InnoDB 标准监控器和锁监控输出
15.17 InnoDB 备份和恢复
15.17.1 InnoDB 备份
15.17.2 InnoDB 恢复
15.18 InnoDB 和 MySQL 主从复制
15.19 InnoDB 分布式缓存插件
15.19.1 InnoDB分布式缓存插件的优点
15.19.2 InnoDB 分布式缓存的架构
15.19.3 设置InnoDB 分布式缓存插件
15.19.4 InnoDB 分布式缓存并行get和范围查询
15.19.5 InnoDB分布式缓存插件的安全因素
15.19.6 InnoDB分布式缓存插件的应用程序
15.19.7 InnoDB 分布式缓存插件和主从复制
15.19.8 InnoDB分布式缓存插件内部构件
15.19.9 InnoDB 分布式缓存插件故障处理
15.20 InnoDB 故障处理
15.20.1 InnoDB I/O 问题故障处理
15.20.2 强制InnoDB恢复
15.20.3 InnoDB 数据字典操作故障处理
15.20.4 InnoDB 错误处理

15.1 InnoDB介绍

InnoDB 是一种通用的存储引擎,平衡了高可靠性和高性能。在MySQL 8.0中,InnoDB是默认的存储引擎。除非您已经配置了一个不同的默认存储引擎,那么在执行CREATE TABLE 语句时,不需要指定ENGINE=就可以创建一个InnoDB 表。

InnoDB的主要优势

InnoDB的主要优势包括:

表 15.1 InnoDB 存储引擎特性

存储限制64TB事务型Yes 锁粒度
MVCCYes 地理空间数据类型支持Yes地理空间索引支持Yes[a]
B-树索引YesT-树索引No 哈希索引No[b]
全文索引Yes[c] 聚集索引Yes数据缓存Yes
索引缓存Yes数据压缩Yes[d] 数据加密[e]Yes
数据库集群支持No复制支持[f]Yes 外键支持Yes
备份/ 时间点恢复[g]Yes 查询缓存支持Yes更新数据字典的统计数据Yes

[a] MySQL 5.7.5 及之后版本的InnoDB支持地理空间索引。

[b] InnoDB在内部利用哈希索引来实现自适应哈希索引功能。

[c] MySQL 5.6.4 及之后版本的InnoDB支持全文索引。

[d] Barracuda 文件格式的InnoDB表才能被压缩

[e] 在服务器中实现,(via 加密功能)。MySQL 5.7及之后的版本,支持静态数据的表空间加密

[f] 在服务器中实现,而不是存储引擎。

[g] 在服务器中实现,而不是存储引擎。


InnoDB的功能特性与其他存储引擎的相比,可以参阅 第16章:可选择的存储引擎中的存储引擎特性表格。

InnoDB 增强和新特性

有关在MySQL 8.0 中InnoDB 增强的和新特性,请参考:

其他的InnoDB 信息和资源

15.1.1 使用InnoDB表的好处

如果您使用的是 MyISAM 表,但是出于技术原因没有提交,你会发行 InnoDB 表的好处有下面这些原因:

  • 如果您的服务器因为硬件或软件的问题而崩溃,不管当时数据库中在发生什么,在重新启动数据库之后,您不需要做任何特别的事情。 InnoDB 故障恢复 会自动的完成在崩溃前提交的任何更改,同时撤销当时真在处理但未提交的任何更改。您只要重新启动并继续你的工作即可。

  • InnoDB 存储引擎维护自己的 缓冲池 。随着数据被访问,会缓存表和索引数据到内存中。 频繁使用的数据是直接在内存中处理。此缓存适用于多种类型的信息并加速处理。在专用数据库服务器上,常常将高达80%的物理内存分配给InnoDB缓冲池。

  • 如果您将相关的数据分离到不同的表中,您可以创建用于执行 引用完整性外键。更新或删除数据,其他表中的相关的数据就会自动的被更新或删除。试图将数据插入到次表,而在主表中没有相应的数据,那么这些坏数据会自动的被踢出。

  • 如果数据在磁盘或内存中损坏, checksum 机制会在您使用这些数据之前,向您发出警告。

  • 当为每个表设计适当的 主键时,涉及这些列的操作将自动被优化。 引用主键列的 WHERE 子句、ORDER BY子句、 GROUP BY 子句和join 操作都会非常快。

  • 一个叫做change buffering的自动优化机制,会自动的优化Inserts、updates和deletes操作。 InnoDB 不仅允许在相同的表上并行的读和写,还将改变的数据缓存起,以简化磁盘I/O.

  • 性能优势并不局限于对大表运行长查询,当从一个表中访问相同的行时, 自适应哈希索引 会让这些查找变得更快,就好像它们是出自一个哈希表中。

  • 您可以压缩包和相关索引。

  • 您能对性能和可用性以最小的影响来创建和删除索引。而且对性能和可用性的影响要小得多。

  • 截断file-per-table 表空间非常快,并释放空间给操作系统使用。而不仅是针对只有 InnoDB才能使用的系统表空间。

  • 使用DYNAMIC格式,对有 BLOB 和长文本类型的表的数据存储布局更高效。

  • 您可以通过查询 INFORMATION_SCHEMA 表来监控存储引擎的内部运作情况。

  • 您可以通查询Performance Schema 表来监控存储引擎的性能细节。

  • 您可以随意的和其他存储引擎的表和InnoDB 表混合使用,甚至是在相同的语句中。例如:您可以使用一个 join操作将 InnoDB表的数据和 MEMORY 表的数据合并到一个查询中。

  • 当处理大数据量时,InnoDB能最大的发挥CPU效率和性能

  • InnoDB 表可以处理大量的数据,即使在操作系统的文件大小限制是2GB。

对您可以运用到您的应用程序的特定于InnoDB的调优技术,请参阅 8.5 节, “优化InnoDB表”

15.1.2 InnoDB 表最佳实践

本节介绍当使用InnoDB 表时的最佳实践

  • 在每个表使用最频繁的单列或多列上指定一个主键,或者,如果没有明显的主键,就使用一个 自动增长值。

  • 使用joins可以在多个有相同ID值的表上提取数据。为了 提升join的性能,将join列定义为外键,在每个表上的这些列使用相同的数据类型。添加外键以确保被引用的列是索引,这样可以提升性能。 外键还会对所有受影响的表传递delete或update,并且,如果在父表中没有相应的ID,就会组织在子表中插入这些数据。

  • 关闭autocommit,每秒提交上百次,会有性能限制(被存储设备的写入速度限制)。

  • 将相关的DML 操作分组到事务,用START TRANSACTIONCOMMIT 语句划分界限,因为您不想过于频繁的进行提交,也不想在执行大批量的 INSERTUPDATE、或 DELETE 语句时,运行了数小时没有提交。

  • 不要使用LOCK TABLES 语句。 InnoDB 可以在不牺牲可靠性和高性能的情况下,同时处理多个多个会话对相同的表进行读写。 要获得对一组行的独占写权限, 可以使用 SELECT ... FOR UPDATE 语法来锁住想要更新的行。

  • 启用innodb_file_per_table 选项,以便将单个表的数据和索引文件单独存放,而不是放在一个大的 系统表空间。 这个设置,还可以使用到其他的特性,如:表 压缩 和快速 截断

    从MySQL 5.6.6起,innodb_file_per_table 选项是默认开启的。

  • 评估您的数据和访问方式是否因为在 CREATE TABLE语句中使用了 InnoDB压缩 特性而受益。 (ROW_FORMAT=COMPRESSED)。您可以在不牺牲读/写能力的情况下压缩InnoDB表。

  • 如果在CREATE TABLE中存在指定 ENGINE=的问题,那么就在运行服务器时,使用 --sql_mode=NO_ENGINE_SUBSTITUTION选项,来阻止表被创建成其他的存储引擎。

15.1.3 验证InnoDB是默认存储引擎

执行 SHOW ENGINES命令查看不同存储引擎的信息,在InnoDB行看到 DEFAULT即表示 InnoDB 是默认的存储引擎。 除此之外,还可以查询INFORMATION_SCHEMA中的 ENGINES 表。

15.1.4 InnoDB基准测试

如果 InnoDB 不是您的默认存储引擎,通过在命令行上使用--default-storage-engine=InnoDB或是在配置文件my.cnf中的[mysqld]部分添加 default-storage-engine=innodb来重启服务器,以确定您的数据库服务器或应用程序正确的在使用 InnoDB存储引擎。

修改了默认存储引擎后,只会影响修改后创建的表,修改前已创建的表还是使用原来的存储引擎。修改后,运行所有的应用程序,一步一步的确认他们都工作正常,运行程序的功能来确定所有的数据加载、编辑和查询都可以正常工作。如果一个表是依赖于 some MyISAM的一些特殊特性,那么你会收到错误信息,那么在可以在CREATE TABLE语句中添加ENGINE=MyISAM来避开那些错误。

如果您对使用InnoDB存储引擎还没有做深入的思考,您只是想看看InnoDB表是如果工作的。可以对每个表执行命令 ALTER TABLE table_name ENGINE=InnoDB;,或者,为了不影响原表,只是和其他语句一起运行测试查询,可以这样做一个表的副本:

CREATE TABLE InnoDB_Table (...) ENGINE=InnoDB AS SELECT * FROM MyISAM_Table;

To get a true idea of the performance with a full application under a realistic workload, install the latest MySQL server and run benchmarks.

Test the full application lifecycle, from installation, through heavy usage, and server restart. Kill the server process while the database is busy to simulate a power failure, and verify that the data is recovered successfully when you restart the server.

测试任何复制配置,尤其是在主从复制中使用了不同的MySQL版本。

15.2 InnoDB和 ACID 模式

ACID模式是数据库的一组设计原则, 它强调可靠性方面,而可靠性对业务数据和关键任务性应用程序非常重要。 MySQL含有的组件中,有与ACID模式紧密结合的InnoDB存储引擎。以至于数据不会被破坏,结果不会因为软件崩溃或硬件故障等特殊情况所扭曲。当您使用遵从ACID的特性时,您不必重新建立一致性检查和故障恢复机制。除非 您有额外的软件保护、超级可靠的硬件,或者应用能够容忍少量数据丢失或不一致。您可以调整MySQL的设置交互一些 ACID的可靠性,来获得更高的性能和吞吐量。

下面几节将讨论MySQL的特性,特别是InnoDB存储引擎,ACID模式的类别交互:

  • A: 原子性.

  • C: 一致性.

  • I:: 隔离性.

  • D: 持久性.

原子性

ACID模式的 原子性 方面主要涉及 InnoDB事务相关的MySQL特性包括:

  • 设置自动提交。

  • 提交 语句。

  • 回滚语句。

  • INFORMATION_SCHEMA表的数据操作。

一致性

ACID模式的 一致性 方面主要涉及内部InnoDB的处理,来保护数据免受故障。相关的MySQL特性包括:

隔离性

ACID模式的 隔离性 方面主要涉及InnoDB事务,尤其是适用于每个事务的 隔离级别,相关的MySQL特性包括:

  • 设置自动提交。

  • 设置隔离级别 语句。

  • InnoDB 锁定的底层细节。在性能调优中,可以通过 INFORMATION_SCHEMA 表看到这些细节。

持久性

ACID模式的 持久性 方面,涉及MySQL软件特性和特定硬件配置的相互作用。 因为影响的因素很多,如:CPU、网络,和存储设备等等。所以这方面最复杂,最难提供指导方针。 (这些指导方针可能就是采取购买新硬件。) 相关的MySQL特性包括:

  • InnoDB doublewrite buffer,可以使用 innodb_doublewrite 配置选项关闭或启用。

  • 配置选项 innodb_flush_log_at_trx_commit

  • 配置选项 sync_binlog

  • 配置选项 innodb_file_per_table

  • 存储设备中的写缓冲,如:磁盘、SSD、或RAID阵列。

  • 存储设备中的断电缓存。

  • 运行MySQL的操作系统,尤其是支持调用系统的fsync()

  • 不间断电源保护(UPS),保护所有运行MySQL服务器的计算机服务器,和储存MySQL数据的存储设备。

  • 备份策略。如:备份的频率和类型,以及备份的保留期限。

  • 对于分布式或托管数据的应用程序,尤其是 MySQL服务所在发数据中心上的特殊特性,以及数据中心之间的网络连接。

15.3 InnoDB 多版本

InnoDB 是一个 MVCC存储引擎: 它为变更的行保存旧版本的信息。以便于支持事务的一致性和 回滚。这些信息储存在,叫做 回滚段 (跟oracle的数据架构类似 )的数据结构的表空间中。 InnoDB使用回滚段中的这些信息来执行事务回滚需要的undo操作。 它同样使用这些信息来构建一个较早版本的行,以进行一致性读

内部, InnoDB在存在库中的每一行添加了三个字段。一个6字节 DB_TRX_ID字段,表示上一次插入或更新行的事务的事务标示符。同样,删除操作,在内部是以一个更新操作对待,就是在对应的行用一个特殊的位标识它已被删除。 每行同样含有一个7字节的 DB_ROLL_PTR 字段叫做回滚指针。 这个回滚指针指向一个写在回滚段的undo日志记录。如果行被更新,这些undo日志记录就含有构建被更新前内容的必要信息。 一个6字节DB_ROW_ID字段,含有行ID,随行的插入而递增。如果 InnoDB 自动生成的是一个聚集索引,索引含有行ID值, 。否则, DB_ROW_ID 列不会出现在任何索引中。

回滚段中的undo日志被分为insert和update两种undo日志。插入undo日志只是在事务回滚时需要,并且一旦事务被提交就会马上丢弃。更新undo日志还会用于一致性读, 但是只有在 InnoDB 没有为其分配一个快照,而在一致的读取时,可能需要更新undo日志中的信息来构建数据行的早期版本时,才可以丢弃它们。

定期提交事务,包括那些只是用于一致性读的,否则, InnoDB不能将更新undo日志的数据丢弃,这样回滚段可能会变大,直至占满表空间。

回滚段中的undo日志记录通常小于相应插入和删除的行。可以使用这些信息来计算所需回滚段空间的大小。

InnoDB多版本模式下, 当使用SQL语句执行一个delete操作时,行不是立刻物理的从数据库中移除,而是在弃用delete操作写的更新undo日志记录时,才物理的移除相应的行。 这个移除操作叫做purge,它非常快,通常采用与删除的SQL语句相同的时间顺序。

如果对表以相同的速度插入和删除行,那么 purge 线程可能开始变得滞后,并且由于所有的dead 行, 使得所有的内容都被磁盘限制,并且非常慢,而表越来越大。这种情况下,通过调整 innodb_max_purge_lag系统变量,来限制新的行的操作,并为purge 线程分配更多的资源。参阅 15.13 节, “InnoDB 启动选项和系统变量”获取更多信息。

多版本和辅助索引

InnoDB多版本并发控制(MVCC)对辅助索引的处理与聚集索引不同,聚集索引是原地更新,并且聚集索引隐含的系统列指向可以重建较早版本记录的undo机制条目。 它们隐藏的系统列指向可以重建早期版本记录的撤消日志条目。 与聚集索引记录不同的是,辅助索引记录不会包含隐含的系统列,也不会原地更新。Unlike clustered index

When a secondary index column is updated, old secondary index records are delete-marked, new records are inserted, and delete-marked records are eventually purged. When a secondary index record is delete-marked or the secondary index page is updated by a newer transaction, InnoDB looks up the database record in the clustered index. In the clustered index, the record's DB_TRX_ID is checked, and the correct version of the record is retrieved from the undo log if the record was modified after the reading transaction was initiated.

If a secondary index record is marked for deletion or the secondary index page is updated by a newer transaction, the covering index technique is not used. Instead of returning values from the index structure, InnoDB looks up the record in the clustered index.

However, if the index condition pushdown (ICP) optimization is enabled, and parts of the WHERE condition can be evaluated using only fields from the index, the MySQL server still pushes this part of the WHERE condition down to the storage engine where it is evaluated using the index. If no matching records are found, the clustered index lookup is avoided. If matching records are found, even among delete-marked records, InnoDB looks up the record in the clustered index.

15.4 InnoDB 结构

本节将介绍InnoDB存储引擎体系结构的主要组件。

15.4.1 Buffer Pool

缓冲池是主内存中的一个区域,当访问数据时InnoDB会缓存表和索引数据缓存到这个区域。缓冲池允许频繁使用的数据直接在内存中处理,这样加快处理速度。在数据库专用的服务器上,常常高达80%的物理内存被分配给InnoDB缓冲池。

为了提高读取操作的效率,缓冲池被划分为可能包含多个行的页。为了提高缓存管理的效率,缓冲池以页链接列表的形式使用,使用了LRU算法的变体。

15.4.2 Change Buffer

The change buffer is a special data structure that caches changes to secondary index pages when affected pages are not in the buffer pool. The buffered changes, which may result from INSERT, UPDATE, or DELETE operations (DML), are merged later when the pages are loaded into the buffer pool by other read operations.

Unlike clustered indexes, secondary indexes are usually non-unique, and inserts into secondary indexes happen in a relatively random order. Similarly, deletes and updates may affect secondary index pages that are not adjacently located in an index tree. Merging cached changes at a later time, when affected pages are read into the buffer pool by other operations, avoids substantial random access I/O that would be required to read-in secondary index pages from disk.

Periodically, the purge operation that runs when the system is mostly idle, or during a slow shutdown, writes the updated index pages to disk. The purge operation can write disk blocks for a series of index values more efficiently than if each value were written to disk immediately.

Change buffer merging may take several hours when there are numerous secondary indexes to update and many affected rows. During this time, disk I/O is increased, which can cause a significant slowdown for disk-bound queries. Change buffer merging may also continue to occur after a transaction is committed. In fact, change buffer merging may continue to occur after a server shutdown and restart (see 15.20.2 节, “Forcing InnoDB Recovery” for more information).

In memory, the change buffer occupies part of the InnoDB buffer pool. On disk, the change buffer is part of the system tablespace, so that index changes remain buffered across database restarts.

The type of data cached in the change buffer is governed by the innodb_change_buffering configuration option. For more information, see 15.6.4 节, “Configuring InnoDB Change Buffering”. You can also configure the maximum change buffer size. For more information, see 15.6.4.1, “Configuring the Change Buffer Maximum Size”.

Change buffering is not supported for a secondary index if the index contains a descending index column or if the primary key includes a descending index column.

Monitoring the Change Buffer

The following options are available for change buffer monitoring:

  • InnoDB Standard Monitor output includes status information for the change buffer. To view monitor data, issue the SHOW ENGINE INNODB STATUS command.

    mysql> SHOW ENGINE INNODB STATUS\G

    Change buffer status information is located under the INSERT BUFFER AND ADAPTIVE HASH INDEX heading and appears similar to the following:

    -------------------------------------
    INSERT BUFFER AND ADAPTIVE HASH INDEX
    -------------------------------------
    Ibuf: size 1, free list len 0, seg size 2, 0 merges
    merged operations:
     insert 0, delete mark 0, delete 0
    discarded operations:
     insert 0, delete mark 0, delete 0
    Hash table size 4425293, used cells 32, node heap has 1 buffer(s)
    13577.57 hash searches/s, 202.47 non-hash searches/s

    For more information, see 15.16.3 节, “InnoDB Standard Monitor and Lock Monitor Output”.

  • The INFORMATION_SCHEMA.INNODB_METRICS table provides most of the data points found in InnoDB Standard Monitor output, plus other data points. To view change buffer metrics and a description of each, issue the following query:

    mysql> SELECT NAME, COMMENT FROM INFORMATION_SCHEMA.INNODB_METRICS WHERE NAME LIKE '%ibuf%'\G

    For INNODB_METRICS table usage information, see 15.14.6 节, “InnoDB INFORMATION_SCHEMA Metrics Table”.

  • The INFORMATION_SCHEMA.INNODB_BUFFER_PAGE table provides metadata about each page in the buffer pool, including change buffer index and change buffer bitmap pages. Change buffer pages are identified by PAGE_TYPE. IBUF_INDEX is the page type for change buffer index pages, and IBUF_BITMAP is the page type for change buffer bitmap pages.

    Warning

    Querying the INNODB_BUFFER_PAGE table can introduce significant performance overhead. To avoid impacting performance, reproduce the issue you want to investigate on a test instance and run your queries on the test instance.

    For example, you can query the INNODB_BUFFER_PAGE table to determine the approximate int of IBUF_INDEX and IBUF_BITMAP pages as a percentage of total buffer pool pages.

    SELECT
    (SELECT COUNT(*) FROM INFORMATION_SCHEMA.INNODB_BUFFER_PAGE
    WHERE PAGE_TYPE LIKE 'IBUF%'
    ) AS change_buffer_pages,
    (
    SELECT COUNT(*)
    FROM INFORMATION_SCHEMA.INNODB_BUFFER_PAGE
    ) AS total_pages,
    (
    SELECT ((change_buffer_pages/total_pages)*100)
    ) AS change_buffer_page_percentage;
    +---------------------+-------------+-------------------------------+
    | change_buffer_pages | total_pages | change_buffer_page_percentage |
    +---------------------+-------------+-------------------------------+
    |                  25 |        8192 |                        0.3052 |
    +---------------------+-------------+-------------------------------+

    For information about other data provided by the INNODB_BUFFER_PAGE table, see Section 24.31.1, “The INFORMATION_SCHEMA INNODB_BUFFER_PAGE Table”. For related usage information, see 15.14.5 节, “InnoDB INFORMATION_SCHEMA Buffer Pool Tables”.

  • Performance Schema provides change buffer mutex wait instrumentation for advanced performance monitoring. To view change buffer instrumentation, issue the following query:

    mysql> SELECT * FROM performance_schema.setup_instruments
    WHERE NAME LIKE '%wait/synch/mutex/innodb/ibuf%';
    +-------------------------------------------------------+---------+-------+
    | NAME                                                  | ENABLED | TIMED |
    +-------------------------------------------------------+---------+-------+
    | wait/synch/mutex/innodb/ibuf_bitmap_mutex             | YES     | YES   |
    | wait/synch/mutex/innodb/ibuf_mutex                    | YES     | YES   |
    | wait/synch/mutex/innodb/ibuf_pessimistic_insert_mutex | YES     | YES   |
    +-------------------------------------------------------+---------+-------+

    For information about monitoring InnoDB mutex waits, see 15.15.2 节, “Monitoring InnoDB Mutex Waits Using Performance Schema”.

15.4.3 自适应哈希索引

The adaptive hash index (AHI) lets InnoDB perform more like an in-memory database on systems with appropriate combinations of workload and ample memory for the buffer pool, without sacrificing any transactional features or reliability. This feature is enabled by the innodb_adaptive_hash_index option, or turned off by --skip-innodb_adaptive_hash_index at server startup.

Based on the observed pattern of searches, MySQL builds a hash index using a prefix of the index key. The prefix of the key can be any length, and it may be that only some of the values in the B-tree appear in the hash index. Hash indexes are built on demand for those pages of the index that are often accessed.

If a table fits almost entirely in main memory, a hash index can speed up queries by enabling direct lookup of any element, turning the index value into a sort of pointer. InnoDB has a mechanism that monitors index searches. If InnoDB notices that queries could benefit from building a hash index, it does so automatically.

With some workloads, the speedup from hash index lookups greatly outweighs the extra work to monitor index lookups and maintain the hash index structure. Sometimes, the read/write lock that guards access to the adaptive hash index can become a source of contention under heavy workloads, such as multiple concurrent joins. Queries with LIKE operators and % wildcards also tend not to benefit from the AHI. For workloads where the adaptive hash index is not needed, turning it off reduces unnecessary performance overhead. Because it is difficult to predict in advance whether this feature is appropriate for a particular system, consider running benchmarks with it both enabled and disabled, using a realistic workload. The architectural changes in MySQL 5.6 and higher make more workloads suitable for disabling the adaptive hash index than in earlier releases, although it is still enabled by default.

The adaptive hash index search system is partitioned. Each index is bound to a specific partition, and each partition is protected by a separate latch. Partitioning is controlled by the innodb_adaptive_hash_index_parts configuration option. The innodb_adaptive_hash_index_parts option is set to 8 by default. The maximum setting is 512.

The hash index is always built based on an existing B-tree index on the table. InnoDB can build a hash index on a prefix of any length of the key defined for the B-tree, depending on the pattern of searches that InnoDB observes for the B-tree index. A hash index can be partial, covering only those pages of the index that are often accessed.

You can monitor the use of the adaptive hash index and the contention for its use in the SEMAPHORES section of the output of the SHOW ENGINE INNODB STATUS command. If you see many threads waiting on an RW-latch created in btr0sea.c, then it might be useful to disable adaptive hash indexing.

更多有关哈希索引的性能特征,参阅 8.3.8 节, “B-Tree和哈希索引的对比”.

15.4.4 重做日志缓冲池

重做日志缓冲池是内存中的一个区域,用于保存写到 重做日志的数据。重做日志缓冲池的大小由 innodb_log_buffer_size配置选项定义。重做日志缓冲池是定期的被将刷新到磁盘的日志文件中。一个大的重做日常缓冲池能允许使大事务允许,从而无需在事务提交前将重做日志写到磁盘上。因此,如果您有update, insert, 或 delete 很多行的事务,那么使用大的重做日志缓冲池将节省磁盘I/O。

innodb_flush_log_at_trx_commit 选项控制重做日志缓冲区中的内容如何写入日志文件。 innodb_flush_log_at_timeout 选项控制重做日志刷新频率。

15.4.5 系统表空间

InnoDB系统表空间包含InnoDB数据字典(InnoDB相关对象的元数据),并且是doublewrite 缓冲池、change buffer和undo日志的存储区域。系统表空间还包含在系统表空间中,任何用户创建的表的表和索引数据。系统表空间被当作一个共享的表空间,因为它是被多个表共享的。

系统表空间由一个或多个数据文件表示。默认情况下是一个系统数据文件,叫做 ibdata1, 被创建在 MySQL data 目录下,文件的大小和编号由启动选项 innodb_data_file_path控制。

相关信息,参阅  15.6.1 节, “InnoDB启动配置”,和  15.7.1 节, “调整InnoDB 系统表空间大小”

15.4.6 Doublewrite Buffer

doublewrite 缓冲池是位于系统表空间中的一个存储区域. 是数据页在写入到数据文件前,数据页要先从InnoDB bufer pool中刷到这个存储区域, 然后写入数据文件。只有对doublewrite缓冲池进行刷页和写页之后,InnoDB才会将其写入适当的位置。如果操作系统,存储子系统或mysqld进程在页写入的过程中发生崩溃。 InnoDB 可以在故障恢复期间从doublewrite 缓冲池找到正确页的副本。

尽管数据通常是写两次,但是doublewrite buffer不需要两次I/O开销和两次I/O操作。写入到doublewrite缓冲池后,doublewrite 缓冲池被当作一个大的有序的块,所以操作系统只需要调用一次 fsync()

大部分情况下默认启用doublewrite buffer。设置 innodb_doublewrite为 0,可以禁用doublewrite缓冲池。

如果系统表空间文件(ibdata files)是放在Fusion-io设备上以支持原子性写,那么 doublewrite缓冲池缓冲会被自动禁用,Fusion-io原子性写是作用于所有数据文件。由于doublewrite缓冲池是全局设置,所以在非Fusion-io硬件设备上,doublewrite buffering也是被禁用的。这个功能仅支持在Fusion-io硬件设备,而且仅仅是Linux上的Fusion-io NVMFS。 为了充分利用这一特性,建议将 innodb_flush_method 设置为 O_DIRECT

15.4.7 Undo 日志

undo日志是单个事务相关联的undo记录的集合。一条undo记录包含有关如何将事务的最新变更undo到聚集索引记录。如果另一个事务需要查看原始数据 (作为一致的读操作的一部分),那么未修改的数据从undo日志记录中获取。undo日志是在 undo日志段undo日志段是在 回滚段中。 回滚段是在 系统表空间临时表空间undo 表空间中。更多有关undo表空间的信息,参阅 15.7.7 节, “配置 Undo 表空间”。关于多版本的信息,参阅 15.3 节, “InnoDB 多版本”

系统表空间,临时表空间和每个undo表空间分别支持最大128 个回滚段。配置选项 innodb_rollback_segments 是定义回滚段的数量,每个回滚段支持高达1023个并发数据修改事务。

15.4.8 file-per-table表空间

file-per-table表空间是,单个表一个表空间,表被创建在自己的文件中,而不是在系统表空间。 Tables are created in file-per-table tablespaces when the 当启用innodb_file_per_table选项,表才会创建在file-per-table表空间中。否则, InnoDB 表都是创建在系统表king就中。 每个.ibd数据文件都代表着一个file-per-table表空间。而且默认创建在数据库目录下。

per-table 表空的文件支持DYNAMICCOMPRESSED的行格式。而这些格式的支持如,可变长数据的off-apge存储和表压缩,等特性。更多关于这些特性和其它 file-per-table表空间的好处,参阅 15.7.4 节, “InnoDB File-Per-Table 表空间”

15.4.9 普通表空间

使用 CREATE TABLESPACE 语法创建一个共享的 InnoDB表空间。普通表空间可以在MySQL数据目录之外创建,能够容纳多个表,并支持所有行格式的表。

使用 CREATE TABLE tbl_name ... TABLESPACE [=] tablespace_nameALTER TABLE tbl_name TABLESPACE [=] tablespace_name语法将表添加到普通表空间中。

更多信息,请参阅15.7.9 节, “InnoDB 普通表空间”

15.4.10 Undo 表空间

一个undo表空间是由一个或多个含有 undo日志的文件组成。 将innodb_undo_tablespaces配置选项设置为非零值就可以创建undo表空间。 更多信息,请参阅15.7.7 节, “配置Undo 表空间”

15.4.11 临时表空间

临时表空间用于非压缩的 InnoDB 临时表和相关对象。 innodb_temp_data_file_path配置选项定义临时表空间数据文件的相对路径。如果没有设置 innodb_temp_data_file_path 会在数据目录下创建一个名为 ibtmp1大小为12MB的自动扩展数据文件。 每次服务器重启时,临时表空间都会被重新创建,并且会动态的生成一个表空间ID,这样就避免与现有的表空间ID发生冲突。临时表空间不能驻留在裸设备上,如果不能创建临时表空间,那么服务器将不能启动。

在正常关闭或强制初始化的时临时表空间会被移除。当发送崩溃时,临时表空间不会被移除。这种情况下,管理员可以手动的移除临时表空间,或者使用相同的配置重启服务器,这样就可以移除并重建临时表空间。

15.4.11.1 临时表undo日志

临时表undo日志是用于临时表和相关对象的。这种类型的 undo日志不是重做日志, 因此临时表早故障恢复是不会被恢复,并且不需要重做日志。不过,在服务器运行时,临时表undo日志用于回滚,这种特殊类型的非重做undo日志的好处,是通过避免对临时表和相关对象,进行重做日志记录I/O来获得性能提升。临时表undo日志是驻留在临时表空间中。

配置选项innodb_rollback_segments 定义临时表空间是用的回滚段数量。

15.4.12 重做日志

重做日志是基于磁盘的数据结构,被用于在故障恢复期间纠正不完整事务写入的数据。正常操作期间,重做日志会对来自SQL语句或调用低级别API来修改数据的请求进行编码。 在接受新的连接之前,那些在服务器意外关闭之前没有完成更新的数据文件,会在初始化期间自动的重新应用变更。 更多关于重做日志在故障恢复担任的角色的信息,请参阅15.17.2 节, “InnoDB 恢复”

默认情况下,重做日志在磁盘上物理上表示为一组名为ib_logfile0ib_logfile1的文件。 MySQL以循环的方式写入重做日志文件。重做日志中的数据按照受影响的记录进行编码,这些数据被统称为重做。 重做日志中的数据以受影响的记录编码;这些数据被统称为重做。以不断增长的LSN值来反应通过重做日志的数据。

相关信息,请参阅:

15.4.12.1 组提交对于重做日志刷新

InnoDB, 像任何其他遵守 ACID的数据库引擎一样,在事务被提交前刷新 重做日志InnoDB 使用 group commit 功能,将多个刷请求组合在一起,以避免每次提交都进行一次刷新。在进行组提交时, InnoDB对日志文件执行一次写操作,这样在同一时间提交多个用户已提交的事务操作,从而显著的提高吞吐量

更多有关 COMMIT性能和其它事务性操作的信息,请参阅 8.5.2 节, “优化InnoDB事务管理”

15.5 InnoDB 锁定和事务模式

本节将讨论几个与InnoDB锁定相关的主题,以及您应该熟悉的InnoDB事务模型。

15.5.1 InnoDB 锁定

本节描述 InnoDB使用的锁的类型。

共享锁和排他锁

InnoDB实现标准的行级锁,其中的两种类型是 共享 (S) 锁排他 (X) 锁

如果事务T1r行,拥有一个共享锁 (S),那么,对来自某些不同T2事务对r行的锁的请求被处理如下:

  • T2请求一个 S 锁会被立刻授予。结果是,T1T2 同时在r上拥有一个S锁。

  • T2请求一个 for an X锁,则不会立刻被授予。

如果事务T1r行持有一个 (X) 锁,那么来自某些不同事务的 T2不管请求什么类型的锁,都不会被立刻授予,而是必须要等待事务T1释放r上的锁后,才会被授予。

意向锁

InnoDB 支持多种粒度的锁,允许行级锁和表级锁共存。 为了使锁能在多个粒度级别上使用,需要使用到其他类型的锁,我们称为意向锁。意向锁是InnoDB中的表级锁,它指明了一个事务在该表中需要哪些类型的锁(共享或排他)。InnoDB中使用了两种类型的意图锁(假定事务T已经在表t上请求了指定类型的锁):

  • 意向共享 (IS): 事务 T 打算在表t的行上设置 S 锁。

  • 意向排他 (IX): 事务 T 打算在表t上的那些行设置 X 锁。

例如: SELECT ... FOR SHARE 设置一个 IS 锁,而 SELECT ... FOR UPDATE 设置一个 IX 锁。

意向锁定的协议如下:

  • 一个事务能在表t的一行上获得 S锁前,它必需先在表t上获得一个 IS 或更高级的锁。

  • 一个事务能在表t的一行上获得 X锁前,它必需先在表t上获得一个 IX 锁。

这些锁的规则可以通过下面的矩阵图来总结。

  - X - - IX - - S - - IS -
X 冲突冲突冲突冲突
IX 冲突共存冲突共存
S 冲突冲突共存共存
IS 冲突共存共存共存

如果事务请求的锁与现有的锁可以共存,那么它的锁请求会被授予锁授予,但是如果请求的锁与现有的锁冲突,那么它请求的锁不会被授予。这个事务需要等待至现有的锁被释放,才会被授予。在释放之前不被授予,是因为它会导致死锁,然后出现错误。

因此,除了请求全表(例如:LOCK TABLES ... WRITE),意向锁不会与任何冲突。IXIS锁是显示某个事务正在锁定行,或准备锁定表上的行。

意向锁的显示的事务数据跟 SHOW ENGINE INNODB STATUSInnoDB 监控器 输出的相似:

TABLE LOCK table `test`.`t` trx id 10080 lock mode IX

Record 锁

record 锁是索引记录上的锁。例如, SELECT c1 FROM t WHERE c1 = 10 FOR UPDATE; 阻止其它任何事务对 t.c110的行进行插入、更新或删除。

Record 锁通常是锁住索引记录,即使表上没有定义索引。对于这样的情况, InnoDB 创建一个隐含的聚集索引,并且在这个索引上使用record 锁。参阅 15.8.2.1, “聚集索引和辅助索引”

Record 锁显示的事务数据跟 SHOW ENGINE INNODB STATUSInnoDB monitor 输出的相似:

RECORD LOCKS space id 58 page no 3 n bits 72 index `PRIMARY` of table `test`.`t` 
trx id 10078 lock_mode X locks rec but not gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 4; hex 8000000a; asc     ;;
 1: len 6; hex 00000000274f; asc     'O;;
 2: len 7; hex b60000019d0110; asc        ;;

Gap 锁

gap 锁是索引记录之间的锁。或是,锁在第一个索引记录和后一个索引记录之间的锁。 例如: SELECT c1 FROM t WHERE c1 BETWEEN 10 and 20 FOR UPDATE; 阻止其它任何事务向 t.c1列插入 值为15,而不管列中是否有这个值。因为在这个范围内的值都被锁定了。

一个gap 跨越一个索引值、多个索引值,甚至是空。

Gap 锁是性能和一致性之间权衡的一部分,并且用于某些事务隔离级别,而不是全部使用。

Gap 锁不需要语句是使用惟一索引来搜索惟一行。(这不包括搜索条件中,只包含多列唯一索引的列,如果是这样的情况,就会出现gap锁定。) 例如: 如果 id 列上有一个唯一索引,下面 下面的语句对id值100的行只使用一个index-record锁,而其他会话是否在gap前插入行并不重要:

SELECT * FROM child WHERE id = 100;

如果 id 不是索引或是一个非唯一索引,这个语句就锁定了前面的gap。

同样值得注意的是,不同的事务可以在一个gap中持有不同的锁。例如,事务A可以在一个gap中持有一个共享的gap锁(gap S-锁),而事务B在相同的gap持有一个排他gap锁(gap x-锁)。之所以允许存在冲突的gap锁,是因为如果一个记录被从一个索引中清除, 那么必须合并不同事务在记录上持有的gap锁。

Gap 锁在InnoDB 中是完全禁用的,这就意味着,只有停止其他对gap进行插入的事务, 它们并没有阻止不同事务在相同的gap使用gap锁。因此, gap X-锁和 gap S-锁是一样的效果。

如果将事务隔离级别设置为 READ COMMITTED便可禁用 Gap 锁。在这种情况下, 用于搜索和索引扫描的 Gap 锁是禁用的,仅用于外键约束检查和重复键检查。

使用 READ COMMITTED隔离级别还有其他的影响,MySQL在计算完 WHERE条件后会释放没有匹配的行上的 Record 锁。对于UPDATE 语句,InnoDB会做一个 半一致性读, 这样,它就返回到MySQL的最新提交版本,这样MySQL就可以确定该行是否与UPDATE中的 WHERE条件匹配。

Next-Key 锁

A next-key lock is a combination of a record lock on the index record and a gap lock on the gap before the index record.

InnoDB performs row-level locking in such a way that when it searches or scans a table index, it sets shared or exclusive locks on the index records it encounters. Thus, the row-level locks are actually index-record locks. A next-key lock on an index record also affects the gap before that index record. That is, a next-key lock is an index-record lock plus a gap lock on the gap preceding the index record. If one session has a shared or exclusive lock on record R in an index, another session cannot insert a new index record in the gap immediately before R in the index order.

Suppose that an index contains the values 10, 11, 13, and 20. The possible next-key locks for this index cover the following intervals, where a round bracket denotes exclusion of the interval endpoint and a square bracket denotes inclusion of the endpoint:

(negative infinity, 10]
(10, 11]
(11, 13]
(13, 20]
(20, positive infinity)

For the last interval, the next-key lock locks the gap above the largest value in the index and the supremum pseudo-record having a value higher than any value actually in the index. The supremum is not a real index record, so, in effect, this next-key lock locks only the gap following the largest index value.

默认情况下, InnoDB operates in REPEATABLE READ transaction isolation level. In this case, InnoDB uses next-key locks for searches and index scans, which prevents phantom rows (see 15.5.4 节, “Phantom Rows”).

Transaction data for a next-key lock appears similar to the following in SHOW ENGINE INNODB STATUS and InnoDB monitor output:

RECORD LOCKS space id 58 page no 3 n bits 72 index `PRIMARY` of table `test`.`t` 
trx id 10080 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

Record lock, heap no 2 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 4; hex 8000000a; asc     ;;
 1: len 6; hex 00000000274f; asc     'O;;
 2: len 7; hex b60000019d0110; asc        ;;

插入意向锁

An insert intention lock is a type of gap lock set by INSERT operations prior to row insertion. This lock signals the intent to insert in such a way that multiple transactions inserting into the same index gap need not wait for each other if they are not inserting at the same position within the gap. Suppose that there are index records with values of 4 and 7. Separate transactions that attempt to insert values of 5 and 6, respectively, each lock the gap between 4 and 7 with insert intention locks prior to obtaining the exclusive lock on the inserted row, but do not block each other because the rows are nonconflicting.

The following example demonstrates a transaction taking an insert intention lock prior to obtaining an exclusive lock on the inserted record. The example involves two clients, A and B.

Client A creates a table containing two index records (90 and 102) and then starts a transaction that places an exclusive lock on index records with an ID greater than 100. The exclusive lock includes a gap lock before record 102:

mysql> CREATE TABLE child (id int(11) NOT NULL, PRIMARY KEY(id)) ENGINE=InnoDB;
mysql> INSERT INTO child (id) values (90),(102);

mysql> START TRANSACTION;
mysql> SELECT * FROM child WHERE id > 100 FOR UPDATE;
+-----+
| id  |
+-----+
| 102 |
+-----+

Client B begins a transaction to insert a record into the gap. The transaction takes an insert intention lock while it waits to obtain an exclusive lock.

mysql> START TRANSACTION;
mysql> INSERT INTO child (id) VALUES (101);

Transaction data for an insert intention lock appears similar to the following in SHOW ENGINE INNODB STATUS and InnoDB monitor output:

RECORD LOCKS space id 31 page no 3 n bits 72 index `PRIMARY` of table `test`.`child`
trx id 8731 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 3 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 4; hex 80000066; asc    f;;
 1: len 6; hex 000000002215; asc     " ;;
 2: len 7; hex 9000000172011c; asc     r  ;;...

AUTO-INC 锁

An AUTO-INC lock is a special table-level lock taken by transactions inserting into tables with AUTO_INCREMENT columns. In the simplest case, if one transaction is inserting values into the table, any other transactions must wait to do their own inserts into that table, so that rows inserted by the first transaction receive consecutive primary key values.

The innodb_autoinc_lock_mode configuration option controls the algorithm used for auto-increment locking. It allows you to choose how to trade off between predictable sequences of auto-increment values and maximum concurrency for insert operations.

更多信息,请参阅 15.8.1.5 节, “InnoDB中AUTO_INCREMENT的处理”

空间索引的谓词锁

InnoDB 支持包含空间列的空间索引(参阅 11.5.8 节, “优化空间分析”)。

为了处理涉及空间 SPATIAL索引的操作的锁定,锁定不能很好地支持 可重复读可串话事务隔离级别。 多维数据中绝对没有排序的概念,以至于不清楚哪个是 next 键。

InnoDB 使用谓词索引。为给带有空间索引的表支持隔离级别。SPATIAL索引包含最小边界矩形值 (MBR), 因此,InnoDB通过为查询使用的MBR值设置一个谓词锁来对索引进行一致的读取。 其它事务就不能对查询条件匹配的行进行插入或修改。

15.5.2 InnoDB 事务模式

InnoDB事务模式,目的是将多版本数据库的最佳性能与传统和传统的两阶段锁定相结合。 InnoDB在行级执行锁定,并以Oracle的风格,在默认情况下以非锁定运行一致性读查询。 InnoDB中的锁信息是被高效的存储着,以至于不需要扩展锁。通常,允许几个用户锁定InnoDB表中的所有行,或任何行的任意子集,只要不会导致InnoDB内存耗尽。

15.5.2.1 事务隔离级别

事务隔离是数据处理的基础之一,隔离是缩写 ACID中的I,隔离级别是,让多个事务同时进行更改和执行查询,是调整性能、可靠性、一致性和再现性之间的平衡的设置

InnoDB 提供了SQL:1992标准所描述的所有四个事务隔离级别: 读未提交读已提交可重复读、和 可串化InnoDB的默认隔离级别是 可重复读

用户可以使用SET TRANSACTION语句来改变当前会话或后续所有连接的的隔离级别。 可以在命令行或配置选项文件中使用 --transaction-isolation 选项为服务器的所有连接设置默认的隔离级别。 关于隔离级别和隔离级别设置的详细信息,请参阅 13.3.6 节, “SET TRANSACTION 语法”

InnoDB对每种隔离级别提供不同的 锁定 策略。您可以使用默认的 可重复读 级别来强制执行高度的一致性。 对关键性的数据进行操作,那么遵守 ACID 是很重要的。 或者您可以在使用 读已提交 甚至是 读未提交隔离级别时放宽一致性的规则。在如批量报告这样的情况下,精确的一致性和可重复的结果比锁定的最小化开销更重要。在 可串化可重复读执行更严格的规则,主要用于特殊场合,如: XA 事务、并发性问题的故障排除和死锁等。

下面的列表描述了MySQL如何支持不同的事务级别。该列表是从最常用的级别,到最少使用的级别进行介绍。

  • 可重复读

    这是InnoDB的默认隔离级别。 相同事务的一致性读 读取的是被第一次读取选定的 快照。这就意味着,在同一个事务总发出多个简单的 (非锁定) SELECT语句,这些 SELECT语句是一样的,相互不影响。可以参阅 15.5.2.3 节, “一致性非锁定读”

    对于 锁定读 (SELECT 带有 FOR UPDATEFOR SHARE), UPDATE, 和 DELETE 语句,是根据带有唯一搜索条件的语句是否使用了唯一索引,或者是范围类型的搜索条件。

    • 对于唯一索引,如果有唯一搜索条件,那么 InnoDB只锁定相应的索引记录。而且不包含记录前的gap

    • 对于其他搜索条件,InnoDB将扫描的索引范围锁定。使用 gap 锁next-key 锁来组织其他会话对这个范围进行插入。请参阅 15.5.1 节, “InnoDB 锁定”

  • 读已提交

    即使在同一事务中,每次一致性读所读取的都被设置为读取自己的快照。关于一致性读的信息,请参阅 15.5.2.3 节, “一致性非锁定读”

    对于锁定读 (SELECT 带有 FOR UPDATEFOR SHARE),在执行UPDATEDELETE 语句时, InnoDB 仅锁定索引记录,不锁定记录前的gap,因此,允许在锁定的记录后面随便插入新记录。 Gap 锁仅用于外键约束检查和重复键检查。

    因为gap 锁定被禁用了,幻影问题可能会出现,因为其他的会话可以在gap中插入新的行。关于幻影的信息,请参阅 15.5.4 节, “幻行”

    如果您使用读已提交,您 必须 使用基于行的形式进行二进制日志记录。

    使用 读已提交 还有其它影响:

    • 对于 UPDATEDELETE 语句, InnoDB 仅对更新或删除的行持有锁。MySQL WHERE条件进行评估之后,释放了不匹配的行的Record锁。这大大降低了发生死锁的可能,但是仍然有可能发生。

    • 对于 UPDATE 语句,如果行已经被锁住, InnoDB 执行a 半一致性读,返回MySQL最新被提交的版本,这样MySQL就能决定行是否与 UPDATE中的WHERE 条件想匹配。如果行匹配(必定被更新),MySQL会再次读取行,这次 InnoDB InnoDB要么锁定它,要么等待已在它上面持有的锁。

    思考下面的示例,从这个表开始:

    CREATE TABLE t (a INT NOT NULL, b INT) ENGINE = InnoDB;
    INSERT INTO t VALUES (1,2),(2,3),(3,2),(4,3),(5,2);
    COMMIT;
    

    这种情况下,表没有索引,因此搜索和索引扫描使用了隐含的聚集索引来进行record锁。 (参阅 15.8.2.1 节, “聚集和辅助索引”)。

    设想,一个客户端使用如下的语句执行一个 UPDATE 操作:

    SET autocommit = 0;
    UPDATE t SET b = 5 WHERE b = 3;
    

    假设同样,又有一个客户端执行下面的 UPDATE语句。

    SET autocommit = 0;
    UPDATE t SET b = 4 WHERE b = 2;
    

    InnoDB每次执行UPDATE,它首先会在每行获得一个排他锁,然后,决定是否修改行。如果InnoDB不修改行,它会释放锁,否则,InnoDB会一直持有锁直到事务结束。这会影响事务像下面处理。

    当使用默认的 REPEATABLE READ隔离级别, 首先, UPDATE 获得一个 x-锁,且不会释放任何一个:

    x-lock(1,2); retain x-lock
    x-lock(2,3); update(2,3) to (2,5); retain x-lock
    x-lock(3,2); retain x-lock
    x-lock(4,3); update(4,3) to (4,5); retain x-lock
    x-lock(5,2); retain x-lock
    

    紧随其后的第二个UPDATE试图申请任何锁(因为第一个update在所有行上都已经持有锁)都不会被处理,直到第一个 UPDATE提交或回滚:

    x-lock(1,2); block and wait for first UPDATE to commit or roll back
    

    相反,如果使用的是 READ COMMITTED ,那么第一个UPDATE获得的是 x-锁,并且释放不会修改的行上的锁:

    x-lock(1,2); unlock(1,2)
    x-lock(2,3); update(2,3) to (2,5); retain x-lock
    x-lock(3,2); unlock(3,2)
    x-lock(4,3); update(4,3) to (4,5); retain x-lock
    x-lock(5,2); unlock(5,2)
    

    对于第二个 UPDATEInnoDB 会做一个 半一致性 读,返回MySQL最新被提交的版本,这样MySQL就能决策行是否与UPDATE中的WHERE 条件想匹配:

    x-lock(1,2); update(1,2) to (1,4); retain x-lock
    x-lock(2,3); unlock(2,3)
    x-lock(3,2); update(3,2) to (3,4); retain x-lock
    x-lock(4,3); unlock(4,3)
    x-lock(5,2); update(5,2) to (5,4); retain x-lock
    
  • 读未提交

    SELECT 语句的执行是非锁定形式,但是也可能使用的是行的较早版。因此,使用这个隔离级别,读就不是一致性的,这种读我们称之为 脏读。其他方面的,跟 READ COMMITTED隔离级别一样。

  • 可串化

    这个级别像 REPEATABLE READ,但是,如果autocommit 是禁用状态那么 InnoDB隐式的将所有简单的SELECT语句转换为 SELECT ... FOR SHARE,如果是开启状态,那么SELECT都将是以事务运行。 因此,此时都知道它是只读状态,如果执行一致性(非锁定)读,并且不需要阻塞其他事务,那么就可以被串化。(禁用autocommit,如果其他事务修改了SELECT选择的行,那么就强制普通的SELECT进行阻塞)。

15.5.2.2 自动提交、提交和回滚

InnoDB中,所有的用户都是以事务的形式运作,如果开启了 autocommit 模式,每个SQL语句都是以单个事务的形式。默认情况下,MySQL 为每个新的连接会话开启autocommit。这样,如果每个SQL语句执行后,没有返回错误,就做一次提交,如果SQL语句返回错误,会根据返回的错误做提交或回滚,参阅 15.20.4 节, “InnoDB 错误处理”

会话开启了autocommit ,在执行多个SQL的事务时,开始的时候可以显式的 START TRANSACTIONBEGIN 语句,然后,结尾使用一个 COMMITROLLBACK 语句,参阅 13.3.1 节, “START TRANSACTION、COMMIT、和 ROLLBACK语法”

如果在会话中使用SET autocommit = 0来禁用autocommit 模式,那么会话将一直保持事务是打开状态,直到一个COMMITROLLBACK 语句结束当前的事务,然后,才能开启新的事务。

如果一个会话的 autocommit 是禁用状态,而且最终没有显式的提交事务,那么MySQL会回滚这个事务。

有些语句隐式地结束一个事务,就好像您在执行语句之前已经完成了一个 提交 。 详细的,请参阅13.3.3 节, “引起隐式提交的语句”

A COMMIT 意味着当前事务中所做的更改成为永久的,并且可以在其他会话中看到。而相反的, ROLLBACK 语句就是取消当前事务所做的修改。 COMMITROLLBACK 都会释放 当前事务持有的 InnoDB 锁。

事务分组DML操作

默认情况下,与MySQL的连接都是启用了 autocommit 模式,这样在每次执行SQL语句时会自动提交。 如果您有其他数据的经验,那么这种方式的操作可能就不大习惯,因为在以往的标准实践中,执行一系列的DML操作,那么全部提交,要么一起全部回滚。

为了让多个语句事务化,在执行SQL语句时,用 SET autocommit = 0来切换到自动提交禁用状态,然后在适当的时候使用 COMMITROLLBACK 结束事务。 为了不是用自动提交,可以以 START TRANSACTION 开始每个事务,然后以 COMMITROLLBACK 结束每个事务。下面的示例显示两个事务,第一个是提交,第二个是回滚。

shell> mysql test

mysql> CREATE TABLE customer (a INT, b CHAR (20), INDEX (a));
Query OK, 0 rows affected (0.00 sec)
mysql> -- 开启自动提交,执行一个事务
mysql> START TRANSACTION;
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO customer VALUES (10, 'Heikki');
Query OK, 1 row affected (0.00 sec)
mysql> COMMIT;
Query OK, 0 rows affected (0.00 sec)
mysql> -- 关闭自动提交,执行另外一个事务
mysql> SET autocommit=0;
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO customer VALUES (15, 'John');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO customer VALUES (20, 'Paul');
Query OK, 1 row affected (0.00 sec)
mysql> DELETE FROM customer WHERE b = 'Heikki';
Query OK, 1 row affected (0.00 sec)
mysql> -- 现在撤销最后的2次插入 和1个删除。
mysql> ROLLBACK;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT * FROM customer;
+------+--------+
| a    | b      |
+------+--------+
|   10 | Heikki |
+------+--------+
1 row in set (0.00 sec)
mysql>
客户端语言中的事务

在应用程序接口,如 PHP, Perl DBI, JDBC, ODBC, 或标准的C 调用MySQL接口,您可以将事务控制语句发送给MySQL服务器,就像其他SQL语句一样,比如SELECT或INSERT。一些api还提供单独的特殊事务提交和回滚功能或方法。

15.5.2.3 一致性非锁定读

一致性读 意思是,InnoDB 使用多版本给查询提供某个时间点数据库快照, 查询将看到在该时间点之前提交的事务所做的更改,以及稍后或未提交的事务所做的更改。例外的就是,查询看到了同一个事务中较早语句所做的修改,这个例外会导致下面的异常: 如果您更新表中的某些行,一个SELECT看到被更新行的最新版本的值,但是也有可能看到其他行就版本值。如果其他会话也在同时更新这个表,这个异常就以为这,您可能会看到表的状态,在数据库中是不存在的。

如果事务的 隔离级别可重复读 (默认级别), 同一事务中所有的一致性读都是读取该事务第一次读取生成的快照。您可以通过提交当前事务,然后执行新的查询就可以为查询获得一个新的快照。

读已提交 的隔离级别下,事务集内的每个一致性读都是读取自己最新的快照。

一致性读,是 InnoDBREAD COMMITTEDREPEATABLE READ 隔离级别下,处理 SELECT 语句的默认方式。一致性读不会在访问的表上设置任何锁,因此其他会话可以自由的修改这个表,同时在表上执行一致性读。

假设,您的服务器是运行在以默认的 REPEATABLE READ 隔离级别下,当您执行一个一致性读 (就是普通的SELECT 语句), InnoDB 会根据您查询所看到的数据库,给您的事务创建一个时间点。如果此时,另外一个事务,在这个时间点后,删除了一行并且提交了,那么您不会看到该行被删除了。同样插入和更新也是类似这样对待。

注意

数据库状态的快照适用于事务中的 SELECT 语句,并不一定是 DML 语句。如果您插入或修改了一些行,然后提交这个事务, 另一个在可重复读隔离级别下的事务,同时执行 DELETEUPDATE语句,那么就可能会影响刚刚那些被提交的行。 即使会话不能查询它们。如果一个事务更新或删除了另外一个事务提交的行,那么这些变更对当前事务就变得可见了。 举个例子,你可能会遇到如下情况:

SELECT COUNT(c1) FROM t1 WHERE c1 = 'xyz';
-- Returns 0: no rows match.
DELETE FROM t1 WHERE c1 = 'xyz';
-- 删除其他事务最近提交的几行记录。

SELECT COUNT(c2) FROM t1 WHERE c2 = 'abc';
-- Returns 0: no rows match.
UPDATE t1 SET c2 = 'cba' WHERE c2 = 'abc';
-- Affects 10 rows: 另外一个事务txn刚刚提交了10行,值为'abc'。
SELECT COUNT(c2) FROM t1 WHERE c2 = 'cba';
-- Returns 10: 现在事务txn可以看到刚刚修改的行。

可以通过提交事务,然后,另外做 SELECTSTART TRANSACTION WITH CONSISTENT SNAPSHOT 来推进时间点。

这个被称为:多版本并发控制

下面的例子中,会话A只有在B提交后,同时A也提交,才会看见B插入的行。因为A提交了,使得A的时间点超过了B提交的时间点。

             Session A              Session B

           SET autocommit=0;      SET autocommit=0;
time
|          SELECT * FROM t;
|          empty set
|                                 INSERT INTO t VALUES (1, 2);
|
v          SELECT * FROM t;
           empty set
                                  COMMIT;

           SELECT * FROM t;
           empty set

           COMMIT;

           SELECT * FROM t;
           ---------------------
           |    1    |    2    |
           ---------------------

如果您想看到 最新的数据状态,可以使用 读已提交隔离级别,或者使用一个 锁定读:

SELECT * FROM t FOR SHARE;

读已提交 隔离级别,事务集内的每次一致性读,都会读取自身最新的快照。 而当使用FOR SHARE就会出现锁定读:因为SELECT会被阻塞,直到含有最新行的事务结束。(参阅 15.5.2.4 节, “锁定读”)。

一致性读在某些DDL语句不起作用:

  • DROP TABLE时,一致性读不起作用。因此MySQL不能使用一个已被删除的表。并且 InnoDB 会摧毁这个表。

  • ALTER TABLE时,一致性读不起作用。因为这个语句会对原始表做一个临时副本,然后临时副本创建好之后删除原始表。当您以事务继续一致性读时,新表中的这些行是不可见的,因为事务的快照被创建时,这些行是不存在的。这种情况下,事务会返回一个错误: ER_TABLE_DEF_CHANGED, 表定义已经改变,请重试事务

读的类型因 INSERT INTO ... SELECT, UPDATE ... (SELECT)CREATE TABLE ... SELECT 等子句中的选择不同而不同,这些子句没有指定 FOR UPDATE or FOR SHARE:

  • 默认情况下,InnoDB 使用更高级的锁, 并且SELECT 部分则像在 读已提交隔离级别中一样执行 这样即使在同一个事务中,每个一致性读都是设置和读取它自己最新的快照。

  • 在这样的场景中使用一致性读,将事务的隔离级别设置为 读未提交, 读已提交,或 可重复读 (也就是说,除 可串化之外的任何级别)。在本例中,从选择的表中读取行记录时没有设置锁。

15.5.2.4 锁定读

如果您在同一个事务中,查询数据,然后插入或更新相关的数据,普通的 SELECT语句没有提供足够的保护,其他事务能对那你刚刚查询的行进行更新和删除。 InnoDB 提供两种类型的 锁定读以提供额外的安全保障:

  • SELECT ... FOR SHARE

    在所读取的行设置一个共享模式的锁,其他会话可以读取这些行,但是不能修改这些行,直到您提交了事务, 如果这些行中的任何一行被另一个尚未提交的事务更改,那么您的查询将等待,直到那个事务结束,然后使用最新的值。

    注意

    SELECT ... FOR SHARESELECT ... LOCK IN SHARE MODE 的替代品,但是 LOCK IN SHARE MODE 对于向后兼容性仍然可用。这两个语句等同的,但是, FOR SHARE 支持 OF table_nameNOWAIT、和SKIP LOCKED 选项。参阅 NOWAIT 和 SKIP LOCKED 的锁定读并发性

  • SELECT ... FOR UPDATE

    对于搜索使用到的索引记录,会锁定行和任何相关的索引记录。这个就像是您对这些行执行一个 UPDATE 语句。 在某些事务隔离级别上,其他事务会被阻塞,而不能更新这些行,做SELECT ... FOR SHARE,或读取数据。 一致性读忽略已在read视图上的记录的任何锁。(无法锁定旧版本的记录,通过应用 undo 日志 在内存中重现一个记录副本。)

这些子句在处理树状结构或图形结构的数据时非常有用,无论是在单个表中,还是在跨多表。 您可以从一个地方到另一个地方遍历边界或树分支,并保留返回和变更这些 指针 值的权限。

当事务提交或回滚时, FOR SHARE he FOR UPDATE 查询设置的所有锁都会被释放。

注意

只有当自动提交禁用时,才会使用 SELECT FOR UPDATE 锁定行进行更新。 (不管是事务开始是使用了 START TRANSACTION ,还是设置 autocommit 为 0,如果自动提交是开启的,那么与条件匹配的行都不会被锁定。

锁定读举例

假设您要向 child表插入一行,并确保 child表中的行在它的父表 parent中有相应的行。您的应用程序代码可以在整个操作序列中确保引用完整性。

首先,对parent进行一个一致性读查询,以确保父表中的行已存在。那么现在你就能安全的将行插入到child表中吗? 不,因为其他某些会话在您不注意的时候,当您在执行SELECTINSERT期间将父表的行删掉。

为了避免这个潜在的问题,在执行 SELECT的时候使用 FOR SHARE:

SELECT * FROM parent WHERE NAME = 'Jones' FOR SHARE;

FOR SHARE 查询返回父表存在 'Jones',然后您就可以安全的向子表CHILD插入相应的记录,然后提交事务。而其他任何在 PARENT表相应的行申请排他锁的事务都需要等待,直到您的事务完成。也就是说,直到所有表中的数据都是在一致性的状态下。

另外一个例子。考虑到CHILD_CODES表中有一个计数器字段,用于给插入到子表CHILD的记录分配一个唯一的标识符。 那么不要对计数器当前值进行一致性读或共享模式读,都没用。数据库的两个用户能看到相同的计数器值。如果两个事务尝试向子表CHILD中添加相同标识符的行,就会出现重复键的错误。

这里, FOR SHARE 不是一个好的解决方案,因为如果两个用户同时读取计数器,那么其中至少有一个会在试图更新计数器时陷入死锁。

为了实现读取和递增计数器,首先使用FOR UPDATE对计数器执行锁定读操作,然后增加计数器的值。例如:

SELECT counter_field FROM child_codes FOR UPDATE;
UPDATE child_codes SET counter_field = counter_field + 1;

SELECT ... FOR UPDATE 读取最新可用数据,然后在读取的每行上设置排他锁。因此,在SQL语句 UPDATE搜索到的行也设置了相同的锁。

前面仅仅是举例描述SELECT ... FOR UPDATE是如何工作。在MySQL中,生成唯一标识符的具体任务,实际上可以通过对表的单一访问来实现:

UPDATE child_codes SET counter_field = LAST_INSERT_ID(counter_field + 1);
SELECT LAST_INSERT_ID();

SELECT语句仅仅检索标示符信息(特定于当前连接的),它不会访问任何表。

NOWAIT 和 SKIP LOCKED的锁定读并发性

If a row is locked by a transaction, a SELECT ... FOR UPDATE or SELECT ... FOR SHARE transaction that requests the same locked row must wait until the blocking transaction releases the row lock. This behavior prevents transactions from updating or deleting rows that are queried for updates by other transactions. However, waiting for a row lock to be released is not necessary if you want the query to return immediately when a requested row is locked, or if excluding locked rows from the result set is acceptable.

为了避免等待其他事务释放行上的锁, NO WAITSKIP LOCKED 选项可用于 SELECT ... FOR UPDATESELECT ... FOR SHARE锁定读语句。

  • NOWAIT

    锁定读,使用 NOWAIT 从来不会等待获取行锁,查询会立刻执行, 如果请求的行上有锁,那么执行失败。

  • SKIP LOCKED

    锁定读,使用 SKIP LOCKED 从来不会等待获取行锁,查询会立刻执行,移除从结果集中已存在的锁。

    注意

    Queries that skip locked rows return an inconsistent view of the data. SKIP LOCKED is therefore not suitable for general transactional work. However, it may be used to avoid lock contention when multiple sessions access the same queue-like table.

NO WAITSKIP LOCKED 仅适用于行级锁。

语句中使用 NO WAITSKIP LOCKED 对于基于语句的复制是不安全的。

The following example demonstrates NOWAIT and SKIP LOCKED. Session 1 starts a transaction that takes a row lock on a single record. Session 2 attempts a locking read on the same record using the NOWAIT option. Because the requested row is locked by Session 1, the locking read returns immediately with an error. In Session 3, the locking read with SKIP LOCKED returns the requested rows except for the row that is locked by Session 1.

# 会话 1:

mysql> CREATE TABLE t (i INT, PRIMARY KEY (i)) ENGINE = InnoDB;

mysql> INSERT INTO t (i) VALUES(1),(2),(3);

mysql> START TRANSACTION;

mysql> SELECT * FROM t WHERE i = 2 FOR UPDATE;
+---+
| i |
+---+
| 2 |
+---+

# 会话 2:

mysql> START TRANSACTION;

mysql> SELECT * FROM t WHERE i = 2 FOR UPDATE NOWAIT;
ERROR 3572 (HY000): Do not wait for lock.

# 会话 3:

mysql> START TRANSACTION;

mysql> SELECT * FROM t FOR UPDATE SKIP LOCKED;
+---+
| i |
+---+
| 1 |
| 3 |
+---+          

15.5.3 InnoDB中不同的SQL语句设置锁

A locking read, an UPDATE, or a DELETE generally set record locks on every index record that is scanned in the processing of the SQL statement. It does not matter whether there are WHERE conditions in the statement that would exclude the row. InnoDB does not remember the exact WHERE condition, but only knows which index ranges were scanned. The locks are normally next-key locks that also block inserts into the gap immediately before the record. However, gap locking can be disabled explicitly, which causes next-key locking not to be used. For more information, see 15.5.1 节, “InnoDB Locking”. The transaction isolation level also can affect which locks are set; see 15.5.2.1, “Transaction Isolation Levels”.

If a secondary index is used in a search and index record locks to be set are exclusive, InnoDB also retrieves the corresponding clustered index records and sets locks on them.

Differences between shared and exclusive locks are described in 15.5.1 节, “InnoDB Locking”.

If you have no indexes suitable for your statement and MySQL must scan the entire table to process the statement, every row of the table becomes locked, which in turn blocks all inserts by other users to the table. It is important to create good indexes so that your queries do not unnecessarily scan many rows.

For SELECT ... FOR UPDATE or SELECT ... FOR SHARE, locks are acquired for scanned rows, and expected to be released for rows that do not qualify for inclusion in the result set (for example, if they do not meet the criteria given in the WHERE clause). However, in some cases, rows might not be unlocked immediately because the relationship between a result row and its original source is lost during query execution. For example, in a UNION, scanned (and locked) rows from a table might be inserted into a temporary table before evaluation whether they qualify for the result set. In this circumstance, the relationship of the rows in the temporary table to the rows in the original table is lost and the latter rows are not unlocked until the end of query execution.

InnoDB sets specific types of locks as follows.

  • SELECT ... FROM is a consistent read, reading a snapshot of the database and setting no locks unless the transaction isolation level is set to SERIALIZABLE. For SERIALIZABLE level, the search sets shared next-key locks on the index records it encounters. However, only an index record lock is required for statements that lock rows using a unique index to search for a unique row.

  • SELECT ... FROM ... FOR SHARE sets shared next-key locks on all index records the search encounters. However, only an index record lock is required for statements that lock rows using a unique index to search for a unique row.

  • SELECT ... FROM ... FOR UPDATE sets an exclusive next-key lock on every record the search encounters. However, only an index record lock is required for statements that lock rows using a unique index to search for a unique row.

    For index records the search encounters, SELECT ... FROM ... FOR UPDATE blocks other sessions from doing SELECT ... FROM ... FOR SHARE or from reading in certain transaction isolation levels. Consistent reads ignore any locks set on the records that exist in the read view.

  • UPDATE ... WHERE ... sets an exclusive next-key lock on every record the search encounters. However, only an index record lock is required for statements that lock rows using a unique index to search for a unique row.

  • When UPDATE modifies a clustered index record, implicit locks are taken on affected secondary index records. The UPDATE operation also takes shared locks on affected secondary index records when performing duplicate check scans prior to inserting new secondary index records, and when inserting new secondary index records.

  • DELETE FROM ... WHERE ... sets an exclusive next-key lock on every record the search encounters. However, only an index record lock is required for statements that lock rows using a unique index to search for a unique row.

  • INSERT sets an exclusive lock on the inserted row. This lock is an index-record lock, not a next-key lock (that is, there is no gap lock) and does not prevent other sessions from inserting into the gap before the inserted row.

    Prior to inserting the row, a type of gap lock called an insert intention gap lock is set. This lock signals the intent to insert in such a way that multiple transactions inserting into the same index gap need not wait for each other if they are not inserting at the same position within the gap. Suppose that there are index records with values of 4 and 7. Separate transactions that attempt to insert values of 5 and 6 each lock the gap between 4 and 7 with insert intention locks prior to obtaining the exclusive lock on the inserted row, but do not block each other because the rows are nonconflicting.

    If a duplicate-key error occurs, a shared lock on the duplicate index record is set. This use of a shared lock can result in deadlock should there be multiple sessions trying to insert the same row if another session already has an exclusive lock. This can occur if another session deletes the row. Suppose that an InnoDB table t1 has the following structure:

    CREATE TABLE t1 (i INT, PRIMARY KEY (i)) ENGINE = InnoDB;
    

    Now suppose that three sessions perform the following operations in order:

    Session 1:

    START TRANSACTION;
    INSERT INTO t1 VALUES(1);
    

    Session 2:

    START TRANSACTION;
    INSERT INTO t1 VALUES(1);
    

    Session 3:

    START TRANSACTION;
    INSERT INTO t1 VALUES(1);
    

    Session 1:

    ROLLBACK;
    

    The first operation by session 1 acquires an exclusive lock for the row. The operations by sessions 2 and 3 both result in a duplicate-key error and they both request a shared lock for the row. When session 1 rolls back, it releases its exclusive lock on the row and the queued shared lock requests for sessions 2 and 3 are granted. At this point, sessions 2 and 3 deadlock: Neither can acquire an exclusive lock for the row because of the shared lock held by the other.

    A similar situation occurs if the table already contains a row with key value 1 and three sessions perform the following operations in order:

    Session 1:

    START TRANSACTION;
    DELETE FROM t1 WHERE i = 1;
    

    Session 2:

    START TRANSACTION;
    INSERT INTO t1 VALUES(1);
    

    Session 3:

    START TRANSACTION;
    INSERT INTO t1 VALUES(1);
    

    Session 1:

    COMMIT;
    

    The first operation by session 1 acquires an exclusive lock for the row. The operations by sessions 2 and 3 both result in a duplicate-key error and they both request a shared lock for the row. When session 1 commits, it releases its exclusive lock on the row and the queued shared lock requests for sessions 2 and 3 are granted. At this point, sessions 2 and 3 deadlock: Neither can acquire an exclusive lock for the row because of the shared lock held by the other.

  • INSERT ... ON DUPLICATE KEY UPDATE differs from a simple INSERT in that an exclusive lock rather than a shared lock is placed on the row to be updated when a duplicate-key error occurs. An exclusive index-record lock is taken for a duplicate primary key value. An exclusive next-key lock is taken for a duplicate unique key value.

  • REPLACE is done like an INSERT if there is no collision on a unique key. Otherwise, an exclusive next-key lock is placed on the row to be replaced.

  • INSERT INTO T SELECT ... FROM S WHERE ... sets an exclusive index record lock (without a gap lock) on each row inserted into T. If the transaction isolation level is READ COMMITTED, InnoDB does the search on S as a consistent read (no locks). Otherwise, InnoDB sets shared next-key locks on rows from S. InnoDB has to set locks in the latter case: In roll-forward recovery from a backup, every SQL statement must be executed in exactly the same way it was done originally.

    CREATE TABLE ... SELECT ... performs the SELECT with shared next-key locks or as a consistent read, as for INSERT ... SELECT.

    When a SELECT is used in the constructs REPLACE INTO t SELECT ... FROM s WHERE ... or UPDATE t ... WHERE col IN (SELECT ... FROM s ...), InnoDB sets shared next-key locks on rows from table s.

  • While initializing a previously specified AUTO_INCREMENT column on a table, InnoDB sets an exclusive lock on the end of the index associated with the AUTO_INCREMENT column. In accessing the auto-increment counter, InnoDB uses a specific AUTO-INC table lock mode where the lock lasts only to the end of the current SQL statement, not to the end of the entire transaction. Other sessions cannot insert into the table while the AUTO-INC table lock is held; see 15.5.2 节, “InnoDB Transaction Model”.

    InnoDB fetches the value of a previously initialized AUTO_INCREMENT column without setting any locks.

  • If a FOREIGN KEY constraint is defined on a table, any insert, update, or delete that requires the constraint condition to be checked sets shared record-level locks on the records that it looks at to check the constraint. InnoDB also sets these locks in the case where the constraint fails.

  • LOCK TABLES sets table locks, but it is the higher MySQL layer above the InnoDB layer that sets these locks. InnoDB is aware of table locks if innodb_table_locks = 1 (the default) and autocommit = 0, and the MySQL layer above InnoDB knows about row-level locks.

    Otherwise, InnoDB's automatic deadlock detection cannot detect deadlocks where such table locks are involved. Also, because in this case the higher MySQL layer does not know about row-level locks, it is possible to get a table lock on a table where another session currently has row-level locks. However, this does not endanger transaction integrity, as discussed in 15.5.5.2 节, “Deadlock Detection and Rollback”. See also 15.8.1.7 节, “Limits on InnoDB Tables”.

15.5.4 幻行

当同一个查询在不同的时间产生不同的行集时,就会出现所谓的幻象问题。例如,如果一个SELECT执行了两次,但是第二次返回了一行,在第一次执行时没有返回的行, 这个行就是一个 行。

假设child表的id列上有一个索引,然后您想读取并锁定所有值大于100的行,之后想在选取的行中更新某些列:

SELECT * FROM child WHERE id > 100 FOR UPDATE;

查询扫描索引,从第一个id大于100的记录开始记录,表中的id记录值有 90 和 102。如果设置在 扫描范围的索引记录上的锁,没有锁定向gaps插入的行。 (在这种情况下,gap是在90和102之间。),其他的会话能将 id 为 101的行插入到表。如果您在原来的事务中执行 SELECT语句,那么从查询返回的结果集中,您可能会看到一个新的id为101的行(一个幻象)。如果我们将行集视为数据项,那么子表中的换行就 违反了事务隔离级别中的,在一个事务运行期间,所读取的数据不会发现变化。

为了防止幻象的发生, InnoDB 使用了一种叫做 next-key 锁定 的算法,它结合了 index-row 锁和 gap 锁。当搜索和扫描表索引时, InnoDB 会用这样的算法执行行级锁定。在索引记录上设置共享或者排他锁。因此行级锁实际上是index-record锁。另外 索引记录上的next-key锁同样还会影响索引记录前的 gap。就是说,索引记录前的gap上的next-key锁,是一个next-key锁外加一个gap锁。 如果一个会话在索引的记录R上有一个共享锁或排他锁,那么另一个会话不能立刻把新的索引记录,按索引顺序插入到R前面的gap中。

InnoDB 扫描一个索引时,它还会将索引中最后一个记录后面的gap锁定。就用前面的例子:为了阻止其它任何id大于100的插入表中。 InnoDB设置的锁中,包括在id值为102后面的gap也有锁。

您可以在应用程序中使用next-key锁来实现唯一性检查,如果您是以共享模式读取数据,并且没有看和您即将插入的行重复,那么您可以安全的插入行,需要知道的是,在您读取行后设置了next-key锁,这样就阻止了其他人插入重复的行。因此,next-key锁可以让您 锁定 表中不存在的内容。

可以禁用Gap 锁定,如在 15.5.1 节, “InnoDB Locking”中描述的。这可能导致幻象问题,因为当禁用 gap锁定时,其他会话可以向gaps插入新行。

15.5.5 InnoDB中的死锁

死锁的场景是,两个事务都在彼此需要的资源上支持有锁,所以导致事务不能继续进行。因为两个事务都在等待着将对方变为可用的资源,所以都不会释放它所持有的锁。

当一个锁定多个表中的行(通过语句,如 UPDATESELECT ... FOR UPDATE),而另一个事务以相反的顺序,就会出现死锁。死锁也会出现在,当索引记录和gaps进行范围锁定,但是由于时间问题每个事务都获得了锁。对于死锁的示例,参阅 15.5.5.1, “一个InnoDB 死锁示例”

为了减少死锁的可能性,使用事务而不是 LOCK TABLES 语句;保持事务的数据足够小,使得事务的持续时间不必过长;当不同的事务更新多个表或这大范围的行,每个事务使用相同的操作顺序(如: SELECT ... FOR UPDATE);在 SELECT ... FOR UPDATEUPDATE ... WHERE 语句使用的列上创建索引。死锁的可能性不会收到隔离级别的影响,因为隔离级别改变的是读操作的方式,而死锁的出现是因为写操作。更多关于避免和处理死锁的信息,请参阅 15.5.5.3 节, “如何最小化和处理死锁”

当开启了死锁检测(默认开启) ,然后出现死锁,InnoDB 检测条件,然后回滚其中一个事务(牺牲者)。如果使用 innodb_deadlock_detect 配置选项来禁用死锁检测, InnoDB 依照 innodb_lock_wait_timeout 的设置,以便在发生死锁时回滚事务,因此,即使您的应用程序逻辑是正确的,您仍然需要处理事务重试的情况。使用 SHOW ENGINE INNODB STATUS 命令可以查看InnoDB用户事务的上一个死锁,如果频繁的出现死锁,则反应出应用程序错误或事务结构有严重问题。服务器运行时,启用 innodb_print_all_deadlocks 设置,可以将所有死锁的信息打印在 mysqld 错误日志中。更多关于死锁如何自动检测和处理的信息,请参阅 15.5.5.2 节, “死锁检测和回滚”

15.5.5.1 一个InnoDB死锁案例

下面举例说明,当锁请求引起死锁时,错误是如何发生的,示例分为两个客户端,A和B。

首先,客户端 A 创建一个表,只并插入一行记录。然后开启一个事务,在这个事务中,A以共享模式在选取的行上获得一个 S 锁:

mysql> CREATE TABLE t (i INT) ENGINE = InnoDB;
Query OK, 0 rows affected (1.07 sec)

mysql> INSERT INTO t (i) VALUES(1);
Query OK, 1 row affected (0.09 sec)

mysql> START TRANSACTION;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT * FROM t WHERE i = 1 FOR SHARE;
+------+
| i    |
+------+
|    1 |
+------+

接着,客户端B,开启一个事务,尝试将表中的行删除:

mysql> START TRANSACTION;
Query OK, 0 rows affected (0.00 sec)

mysql> DELETE FROM t WHERE i = 1;

delete操作请求一个 X 锁,但是这个锁不会被授予,因为它与客户端A持有的 S 锁不能共存,所以对行请求会放到锁请求队列中。

最后,客户端A同样尝试对表的行进行删除:

mysql> DELETE FROM t WHERE i = 1;
ERROR 1213 (40001): Deadlock found when trying to get lock;
try restarting transaction

现在出现了死锁,因为客户端A需要一个 X 锁来删除行,但是,这个锁请求不会被授予,因为客户端B已经请求了X锁,而且正等着客户端A释放它的S锁。因为客户端B已经提前在申请X锁,那么客户端A持有的S 锁就不能升级为 X锁。因此,基于这样的结果, InnoDB 会为其中一个客户端产生一个错误,然后释放它所持有的锁。客户端返回这样的错误:

ERROR 1213 (40001): Deadlock found when trying to get lock;
try restarting transaction

那么此时,另一个客户端请求的锁就可以被授予,然后它将表中的行删除掉。

15.5.5.2 死锁检测和回滚

死锁检测 被启用(默认的), InnoDB 自动检测事务死锁,并回滚一个事务或多个事务以打破死锁。 InnoDB 尝试选择小的事务进行回滚,而事务的大小是由插入、更新和删除的行数决定。

如果innodb_table_locks = 1(默认情况下)且 autocommit = 0,并且上层的MySQL知道是行级锁。那么 InnoDB就知道是表锁。 否则,InnoDB不会检测死锁。如:由一个MySQL语句LOCK TABLES设置的表锁,或其它非InnoDB存储引擎设置的表锁时。那么这样的场景,可以通过设置系统变量 innodb_lock_wait_timeout的值来解决。

InnoDB 对一个事务执行完全回滚时,该事务所持有的所有锁都会被释放。如果仅仅是一个错误而回滚一个SQL语句,那么这个语句设置的某些锁会被保留,这是因为 InnoDB用一种格式储存行锁,以至于它不知道哪个锁是哪个语句设置的。

如果事务中的一个 SELECT 调用了一个存储函数,而函数中的一个语句失败了,那么该语句会被回滚。除此之外,如果 ROLLBACK是在这之后发生的,那么整个事务都会被回滚。

如果 InnoDB 监控器输出的信息的 LATEST DETECTED DEADLOCK 部分显示的状态是 :TOO DEEP OR LONG SEARCH IN THE LOCK TABLE WAITS-FOR GRAPH, WE WILL ROLL BACK FOLLOWING TRANSACTION, 这就表明,等待列表中事务的数量达到了200上线。而超过的事务会以死锁对待,并且尝试检查等待列表(wait-for list)的事务会被回滚。同样的错误可能还会出现,如果锁定线程this 等待列表中的事务占有超过1,000,000个锁,而锁定线程又必须查看等待列表时。

对于组织数据库操作以避免死锁的技术,参阅15.5.5 节, “InnoDB中的死锁”

禁用死锁检测

在搞并发的系统中,当很多线程等待相同的锁时,死锁检测可能会导系统变慢。 有时,禁用死锁检测,并在发生死锁时根据 innodb_lock_wait_timeout 超时设置来进行事务回滚,可能会更有效。使用 innodb_deadlock_detect 配置选项可以禁用死锁检测。

15.5.5.3 如何最小化和处理死锁

本节是建立在有关于死锁的概念信息之上,介绍如何组织数据库操作以便最小化死锁,以及程序中对后续报错的处理。

关系型数据库中死锁 是一个经典的问题。但是它们并不可怕,除非它们频繁发生,以至于您根本不能运行某些事务。通常情况下,您必须让应用程序在遇见死锁而回滚时,可以重新执行事务。

InnoDB 使用自动行级锁定,即使是在只插入或删除一行的事务中,您也会得到死锁。这是因为这些操作并不是真正的原子; 它们在插入或删除的行的索引(可能多个)记录上自动设置锁

您可以使用下面的技术来处理死锁,和减少死锁发生的可能性:

  • 任何时候都可以执行命令 SHOW ENGINE INNODB STATUS来检查最近引起死锁的原因,这样可以帮助调整应该程序避免死锁。

  • 如果频繁发生死锁,可以启用innodb_print_all_deadlocks 配置选项,来收集更多的调试信息,这是所有死锁的信息,而不仅仅是最新的死锁,这些信息记录在 MySQL 错误日志中。当调试完后,禁用这个选项。

  • 如果事务因为死锁而失败,随时准备重试,死锁并不危险,只需重试即可。

  • 保持事务的周期短小以减少冲突的发生。

  • 在进行了一系列相关更改后应立即提交事务,使其不易发生冲突。特别要注意的是,不要让一个交互的mysql会话在一个未提交的事务中打开很长时间。

  • 如果您使用了 锁定读 (SELECT ... FOR UPDATESELECT ... FOR SHARE),试着使用较低的隔离级别,如: 读已提交

  • 当在一个事务中修改多个表,或同一个表中的不同行集时,每次都以一致的顺序执行这些操作。这样就会形成良好的事务队列,而不是死锁。举个栗子,使用应用程序将数据库操作放到函数中,或调用存储程序,而不是在不同的地方编写多个相似序列的 INSERT, UPDATE, 和 DELETE 语句。

  • 给您的表添加恰当的索引,这样您的查询只需扫描更少的索引记录,从而会设置更少的锁。使用 EXPLAIN SELECT 来确定MySQL认为的哪些索引最适合您的查询。

  • 少使用锁,如果您的查询能被允许从一个旧的快照中返回数据,那么就不要在查询语句中添加 FOR UPDATEFOR SHARE 子句。 此时使用 READ COMMITTED 隔离级别是不错的选择,因为这个隔离级别中,同一个事务中的每次一致性读,都是读取事务当前的最新快照。

  • 如都没有帮助,那么使用表级锁串化您的事务。正确的方式就是在事务性表使用 例如: InnoDB表, LOCK TABLES之后,使用 SET autocommit = 0 (不是 START TRANSACTION)开始事务,直到您明确的提交事务,才调用 UNLOCK TABLES。举个例子:如果您需要对 t1 写,并且对 t2 读,您可以这样做:

    SET autocommit=0;
    LOCK TABLES t1 WRITE, t2 READ, ...;
    ... do something with tables t1 and t2 here ...
    COMMIT;
    UNLOCK TABLES;
    

    表级锁防止了对表的并发更新,从而避免了死锁,而是以牺牲系统的响应能力为代价。

  • Another way to serialize transactions is to create an auxiliary semaphore table that contains just a single row. Have each transaction update that row before accessing other tables. In that way, all transactions happen in a serial fashion. Note that the InnoDB instant deadlock detection algorithm also works in this case, 因为序列化锁是一个行级锁。使用MySQL表级锁时,必须使用超时方法来解决死锁。

15.6 InnoDB 配置

本节提供InnoDB初始化、启动以及InnoDB存储引擎的各种组件和特性的配置信息和过程。

15.6.1 InnoDB 启动设置

关于 InnoDB,首先讨论的涉及数据文件、日志文件和内存缓冲区的配置。 建议您在创建 InnoDB 实例之前指定数据文件、日志文件和页大小的配置。在创建InnoDB实例之后修改数据文件或日志文件的配置,可能会历经一个不平凡的过程。而且页的大小,只能在第一次初始化 InnoDB实例时定义。

除了这些主题外,本节还提供了关于在配置文件中指定InnoDB 选项,观察InnoDB初始化信息,以及重要的存储注意事项的信息。

指定MySQL配置文件中选项

因为MySQL使用配置的数据文件,日志文件和页大小,来初始化 InnoDB 实例。建议您在首次初始化InnoDB之前,将这些设置定义到一个配置文件中,以便于MySQL启动时从其读取。 InnoDB是在MySQL服务器启动时被初始化的,所以InnoDB的首次初始化通常发生在您第一次琪逗弄MySQL服务器。

您可以将InnoDB 的选项配置放在服务器启动时读取的任何选项文件中的[mysqld]组中。MySQL的选项文件位置在 4.2.6 节, “使用选项文件”中有描述。

确保mysqld仅从一个指定的文件(以及 mysqld-auto.cnf)读取选项。当启动服务器时,使用 --defaults-file 选项作为命令行的第一选项:

mysqld --defaults-file=path_to_configuration_file

观察InnoDB初始化信息

从命令行启动 mysqld时,可以在 服务器启动期间观察InnoDB 初始化信息。 当mysqld从命令提示符启动时,将把初始化信息打印到控制台。

例如,在Windows上,如果mysqld位于 C:\Program Files\MySQL\MySQL Server 8.0\bin,可以这样启动MySQL服务器:

C:\> "C:\Program Files\MySQL\MySQL Server 8.0\bin\mysqld" --console

在类Unix系统中, mysqld 位于MySQL安装目录下的 bin 目录中:

sell> bin/mysqld --user=mysql &

如果您没有将服务的输出显示在控制台,那么在启动服务器后的检查错误日志,便可以观察启动过程中InnoDB的初始化信息。

关于使用其他方式启动MySQL发信息,请参阅 2.9.5 节, “自动启动和关闭MySQL”

储存的重要注意事项

Review the following storage-related considerations before proceeding with your startup configuration.

  • In some cases, database performance improves if the data is not all placed on the same physical disk. Putting log files on a different disk from data is very often beneficial for performance. For example, you can place system tablespace data files and log files on different disks. You can also use raw disk partitions (raw devices) for InnoDB data files, which may speed up I/O. See 15.7.3 节, “Using Raw Disk Partitions for the System Tablespace”.

  • InnoDB is a transaction-safe (ACID compliant) storage engine for MySQL that has commit, rollback, and crash-recovery capabilities to protect user data. However, it cannot do so if the underlying operating system or hardware does not work as advertised. Many operating systems or disk subsystems may delay or reorder write operations to improve performance. On some operating systems, the very fsync() system call that should wait until all unwritten data for a file has been flushed might actually return before the data has been flushed to stable storage. Because of this, an operating system crash or a power outage may destroy recently committed data, or in the worst case, even corrupt the database because of write operations having been reordered. If data integrity is important to you, perform some pull-the-plug tests before using anything in production. On OS X 10.3 and higher, InnoDB uses a special fcntl() file flush method. Under Linux, it is advisable to disable the write-back cache.

    On ATA/SATA disk drives, a command such hdparm -W0 /dev/hda may work to disable the write-back cache. Beware that some drives or disk controllers may be unable to disable the write-back cache.

  • With regard to InnoDB recovery capabilities that protect user data, InnoDB uses a file flush technique involving a structure called the doublewrite buffer, which is enabled by default (innodb_doublewrite=ON). The doublewrite buffer adds safety to recovery following a crash or power outage, and improves performance on most varieties of Unix by reducing the need for fsync() operations. It is recommended that the innodb_doublewrite option remains enabled if you are concerned with data integrity or possible failures. For additional information about the doublewrite buffer, see 15.11.1, “InnoDB Disk I/O”.

  • Before using NFS with InnoDB, review potential issues outlined in Using NFS with MySQL.

系统表空间数据文件配置

系统表空间数据文件的配置是使用 innodb_data_file_pathinnodb_data_home_dir配置选项。

配置选项innodb_data_file_path 是用于配置 InnoDB 系统表空间文件。 value of innodb_data_file_path 的值应该指定的一个或多个数据文件列表。如果您使用多个数据文件,使用分号 (;) 字符将其分开:

innodb_data_file_path=datafile_spec1[;datafile_spec2]...

例如,下面的设置明确的创建一个最小的系统表空间:

[mysqld]
innodb_data_file_path=ibdata1:12M:autoextend

This setting configures a single 12MB data file named ibdata1 that is auto-extending. No location for the file is given, so 默认情况下, InnoDB creates it in the MySQL data directory.

大小使用 K, M, 或 G 后缀指定大小,表示的单位分别是 KB, MB, 或 GB。

A tablespace containing a fixed-size 50MB data file named ibdata1 and a 50MB auto-extending file named ibdata2 in the data directory can be configured like this:

[mysqld]
innodb_data_file_path=ibdata1:50M;ibdata2:50M:autoextend

The full syntax for a data file specification includes the file name, its size, and several optional attributes:

file_name:file_size[:autoextend[:max:max_file_size]]

The autoextend and max attributes can be used only for the last data file in the innodb_data_file_path line.

If you specify the autoextend option for the last data file, InnoDB extends the data file if it runs out of free space in the tablespace. The increment is 64MB at a time by default. To modify the increment, change the innodb_autoextend_increment system variable.

If the disk becomes full, you might want to add another data file on another disk. For tablespace reconfiguration instructions, see 15.7.1, “Resizing the InnoDB System Tablespace”.

InnoDB is not aware of the file system maximum file size, so be cautious on file systems where the maximum file size is a small value such as 2GB. To specify a maximum size for an auto-extending data file, use the max attribute following the autoextend attribute. Use the max attribute only in cases where constraining disk usage is of critical importance, because exceeding the maximum size causes a fatal error, possibly including a crash. The following configuration permits ibdata1 to grow up to a limit of 500MB:

[mysqld]
innodb_data_file_path=ibdata1:12M:autoextend:max:500M

InnoDB creates tablespace files in the MySQL data directory by default (datadir). To specify a location explicitly, use the innodb_data_home_dir option. For example, to create two files named ibdata1 and ibdata2 in a directory named myibdata, configure InnoDB like this:

[mysqld]
innodb_data_home_dir = /path/to/myibdata/
innodb_data_file_path=ibdata1:50M;ibdata2:50M:autoextend
Note

A trailing slash is required when specifying a value for innodb_data_home_dir.

InnoDB does not create directories, so make sure that the myibdata directory exists before you start the server. Use the Unix or DOS mkdir command to create any necessary directories.

Make sure that the MySQL server has the proper access rights to create files in the data directory. More generally, the server must have access rights in any directory where it needs to create data files.

InnoDB forms the directory path for each data file by textually concatenating the value of innodb_data_home_dir to the data file name. If the innodb_data_home_dir option is not specified in my.cnf at all, the default value is the dot directory ./, which means the MySQL data directory. (The MySQL server changes its current working directory to its data directory when it begins executing.)

If you specify innodb_data_home_dir as an empty string, you can specify absolute paths for the data files listed in the innodb_data_file_path value. The following example is equivalent to the preceding one:

[mysqld]
innodb_data_home_dir =
innodb_data_file_path=/path/to/myibdata/ibdata1:50M;/path/to/myibdata/ibdata2:50M:autoextend

InnoDB 日志文件配置

默认情况下,InnoDB 在MySQL数据目录(datadir)下创建两个大小为48MB,叫做 ib_logfile0ib_logfile1的日志文件。

下面的选项可以用来修改默认配置:

  • innodb_log_group_home_dir defines directory path to the InnoDB log files (the redo logs). If this option is not configured, InnoDB log files are created in the MySQL data directory (datadir).

    You might use this option to place InnoDB log files in a different physical storage location than InnoDB data files to avoid potential I/O resource conflicts. For example:

    [mysqld]
    innodb_log_group_home_dir = /dr3/iblogs
    
    注意

    InnoDB does not create directories, so make sure that the log directory exists before you start the server. Use the Unix or DOS mkdir command to create any necessary directories.

    Make sure that the MySQL server has the proper access rights to create files in the log directory. More generally, the server must have access rights in any directory where it needs to create log files.

  • innodb_log_files_in_group 定义日志组中日志文件的个数,默认的和推荐的值是2。

  • innodb_log_file_size 定义日志组中每个日志文件的大小。日志文件加起来的 (innodb_log_file_size * innodb_log_files_in_group) 大小不要超过最大值,也就是要略小于512GB。例如:一对255 GB 额日志文件,解决限制,但是没有超过限制。默认的日志文件大小是 48MB。通常情况下,日志文件的全部大小应该足够大,以至于服务器可以平滑度过高峰和低谷。这通常意味着,有足够大重做日志空间来处理超过一个小时的写活动。值越大,缓冲池中所需的检查点刷新活动就越少,这样节省了磁盘的I/O。参阅 8.5.4 节, “优化InnoDB重做日志的记录”

InnoDB Undo 表空间配置

默认情况下, InnoDB undo 日志是系统表空间的一部分。然而,您也可以选择将 InnoDB undo 日志保存在一个或多个独立的undo表空间中。尤其是放在不同的存储设备上。

innodb_undo_directory 配置选项定义了InnoDB 为undo日志创建独立表空间的路径。这个选项通常是结合 innodb_rollback_segmentsinnodb_undo_tablespaces 选项使用。这样就决定了在系统表空间之外undo日志的磁盘分布。

更多信息,请参阅 15.7.7 节, “配置Undo表空间”

InnoDB 临时表空间配置

默认情况下, InnoDB creates a single auto-extending temporary tablespace data file named ibtmp1 that is slightly larger than 12MB in the innodb_data_home_dir directory. The default temporary tablespace data file configuration can be modified at startup using the innodb_temp_data_file_path configuration option.

innodb_temp_data_file_path 选项配置option specifies the path, file name, and file size for InnoDB 临时表空间数据文件的文件路径,文件名和文件大小。 innodb_temp_data_file_path指定路径的文件的完整目录路径要联合 innodb_data_home_dir。 文件大小使用 KB, MB, 或 GB (1024MB) 指定,数字后面跟上 K,M, 或 G 。文件的总大小必须超过 12MB。

innodb_data_home_dir 的默认值是MySQL数据目录 (datadir)。

InnoDB 页大小配置

选项 innodb_page_size 是为MySQL实例中所有的InnoDB表空间指定页的大小。这个值是在创建实例的时候指定的,之后保持不变。 有效值是 64k, 32k, 16k (默认的), 8k, 和 4k。当然您也可以选择以字节的方式指定页的大小(65536, 32768, 16384, 8192,4096)。

默认的16k适用于广泛的场景,尤其是涉及到表扫描的查询和批量更新的DML操作,较小的页可能对有很多小的写的OLTP环境,在大呢页面含有许多行,可能会产生争用问题。 较小的页面可能也会对SSD存储设备有效,因为通常SSD使用小的块。将InnoDB页面大小与存储设备块大小保持一致,把不用改变的数据重写到磁盘时,最小化数据。

InnoDB 内存配置

MySQL allocates memory to various caches and buffers to improve performance of database operations. When allocating memory for InnoDB, always consider memory required by the operating system, memory allocated to other applications, and memory allocated for other MySQL buffers and caches. For example, if you use MyISAM tables, consider the amount of memory allocated for the key buffer (key_buffer_size). For an overview of MySQL buffers and caches, see 8.12.3.1 节, “MySQL 如何使用内存?”

使用下面的参数配置特定于InnoDB的缓冲池:

警告

在 32-位 GNU/Linux x86 平台上,注意不要将内存使用量设置得太高。 glibc 会允许进程堆在线程堆栈上增长,这会使您的服务器崩溃。 如果分配给mysqld进程的内存用于全局和每个线程缓冲区和缓存的内存接近或超过2GB,那么系统会面临风险。

一个类似于下面的计算MySQL内存分配的公式。可以用来估计MySQL内存的使用情况。 您可能需要调整公式,来说明在您的MySQL版本和配置下,缓冲池和缓存区的占用情况。 关于MySQL缓冲池和缓存的概述,请参阅 8.12.3.1, “How MySQL Uses Memory”.

innodb_buffer_pool_size
+ key_buffer_size
+ max_connections*(sort_buffer_size+read_buffer_size+binlog_cache_size)
+ max_connections*2MB

每个线程使用一个堆栈 (通常是 2MB,但是Oracle公司提供的MySQL二进制发行版中只有256k。),且在最坏的情况下,还使用 sort_buffer_size + read_buffer_size 附加内存。

在Linux平台上,如果内核启用了大页支持。 InnoDB 可以使用大页来为缓冲池分配内存。参阅 8.12.3.2 节, “启用大页支持”

15.6.2 配置InnoDB只读操作

You can now query InnoDB tables where the MySQL data directory is on read-only media, by enabling the --innodb-read-only configuration option at server startup.

如何启用

To prepare an instance for read-only operation, make sure all the necessary information is flushed to the data files before storing it on the read-only medium. Run the server with change buffering disabled (innodb_change_buffering=0) and do a slow shutdown.

To enable read-only mode for an entire MySQL instance, specify the following configuration options at server startup:

  • --innodb-read-only=1

  • If the instance is on read-only media such as a DVD or CD, or the /var directory is not writeable by all: --pid-file=path_on_writeable_media and --event-scheduler=disabled

As of MySQL 8.0, enabling innodb_read_only prevents table creation and drop operations for all storage engines. These operations modify data dictionary tables in the mysql system database, but those tables use the InnoDB storage engine and cannot be modified when innodb_read_only is enabled. The same restriction applies to any operation that modifies data dictionary tables, such as ANALYZE TABLE and ALTER TABLE tbl_name ENGINE=engine_name.

In addition, other tables in the mysql system database use the InnoDB storage engine in MySQL 8.0. Making those tables read only results in restrictions on operations that modify them. For example, CREATE USER, GRANT, REVOKE, and INSTALL PLUGIN operations are not permitted in read-only mode.

Usage Scenarios

This mode of operation is appropriate in situations such as:

  • Distributing a MySQL application, or a set of MySQL data, on a read-only storage medium such as a DVD or CD.

  • Multiple MySQL instances querying the same data directory simultaneously, typically in a data warehousing configuration. You might use this technique to avoid bottlenecks that can occur with a heavily loaded MySQL instance, or you might use different configuration options for the various instances to tune each one for particular kinds of queries.

  • Querying data that has been put into a read-only state for security or data integrity reasons, such as archived backup data.

Note

This feature is mainly intended for flexibility in distribution and deployment, rather than raw performance based on the read-only aspect. See 8.5.3 节, “Optimizing InnoDB Read-Only Transactions” for ways to tune the performance of read-only queries, which do not require making the entire server read-only.

如何工作?

当使用--innodb-read-only选项,服务器将运行在只读模式下。某些InnoDB的特性和组件会被减少或完全关闭:

  • 没有 change buffering ,尤其是不会从change buffer合并,在为只读准备实例时,为了确保change buffer是空的 先做一个slow shutdown,并禁用change buffer(innodb_change_buffering=0)。

  • 服务器启动阶段不会做故障恢复 。所以实例在至于只读状态前必须执行一个 slow shutdown

  • 因为重做日志不用于只读操作,所以实例设置为只读前,可以将 innodb_log_file_size设置的尽可能小(1 MB)。

  • 除 I/O 读线程以外,所有的后台线程都被关闭,因此,只读模式下不会遇见任何死锁。 any deadlock.

  • 死锁信息、监控器输出等等,不会写到临时文件中,因此 SHOW ENGINE INNODB STATUS 不会产生任何输出。

  • 当服务器处于只读模式时,变更通常影响写操作行为的配置选项,不再起作用。

  • 执行隔离级别MVCC被关闭了。所有的查询都是读取最新版的记录,因为不可能更新和删除。

  • 不是用 undo log。所以禁用 innodb_undo_tablespacesinnodb_undo_directory配置选项的任何设置。

15.6.3 InnoDB 缓冲池配置

本节提供InnoDB缓冲池的配置和调优信息。

15.6.3.1 InnoDB缓冲池

InnoDB maintains a storage area called the buffer pool for caching data and indexes in memory. Knowing how the InnoDB buffer pool works, and taking advantage of it to keep frequently accessed data in memory, is an important aspect of MySQL tuning. For information about how the InnoDB buffer pool works, see InnoDB Buffer Pool LRU Algorithm.

您可以从 InnoDB缓冲池的多个方面到配置来提升性能。

InnoDB缓冲池 LRU 算法

InnoDB manages the buffer pool as a list, using a variation of the least recently used (LRU) algorithm. When room is needed to add a new page to the pool, InnoDB evicts the least recently used page and adds the new page to the middle of the list. This midpoint insertion strategy treats the list as two sublists:

  • At the head, a sublist of new (or young) pages that were accessed recently.

  • At the tail, a sublist of old pages that were accessed less recently.

This algorithm keeps pages that are heavily used by queries in the new sublist. The old sublist contains less-used pages; these pages are candidates for eviction.

The LRU algorithm operates as follows by default:

  • 3/8 of the buffer pool is devoted to the old sublist.

  • The midpoint of the list is the boundary where the tail of the new sublist meets the head of the old sublist.

  • When InnoDB reads a page into the buffer pool, it initially inserts it at the midpoint (the head of the old sublist). A page can be read in because it is required for a user-specified operation such as an SQL query, or as part of a read-ahead operation performed automatically by InnoDB.

  • Accessing a page in the old sublist makes it young, moving it to the head of the buffer pool (the head of the new sublist). If the page was read in because it was required, the first access occurs immediately and the page is made young. If the page was read in due to read-ahead, the first access does not occur immediately (and might not occur at all before the page is evicted).

  • As the database operates, pages in the buffer pool that are not accessed age by moving toward the tail of the list. Pages in both the new and old sublists age as other pages are made new. Pages in the old sublist also age as pages are inserted at the midpoint. Eventually, a page that remains unused for long enough reaches the tail of the old sublist and is evicted.

默认情况下, pages read by queries immediately move into the new sublist, meaning they stay in the buffer pool longer. A table scan (such as performed for a mysqldump operation, or a SELECT statement with no WHERE clause) can bring a large amount of data into the buffer pool and evict an equivalent amount of older data, even if the new data is never used again. Similarly, pages that are loaded by the read-ahead background thread and then accessed only once move to the head of the new list. These situations can push frequently used pages to the old sublist, where they become subject to eviction. For information about optimizing this behavior, see 15.6.3.4 节, “标记缓冲池扫描持久”, and 15.6.3.5 节, “配置InnoDB缓冲池预提取(Read-Ahead)”.

InnoDB Standard Monitor output contains several fields in the BUFFER POOL AND MEMORY section that pertain to operation of the buffer pool LRU algorithm. For details, see 15.6.3.9 节, “使用InnoDB标准监控器监控缓冲池”

InnoDB缓冲池配置选项

Several configuration options affect different aspects of the InnoDB buffer pool.

15.6.3.2 配置InnoDB缓冲池大小

当服务器运行时,您可以在脱机(启动时)或在线配置InnoDB的缓冲池的大小。本节描述这两种情况下的配置方式,更多有关在线配置缓冲池大小的信息,请参阅 在线配置InnoDB缓冲池大小

当增加和减小 innodb_buffer_pool_size,都是按照块执行的,块的大小由 innodb_buffer_pool_chunk_size配置选项定义,默认块的大小是 128M。更多信息,请参阅 配置InnoDB缓冲池块大小

缓冲池的大小通常与 innodb_buffer_pool_chunk_size * innodb_buffer_pool_instances相等或者倍数关系。如果您将 innodb_buffer_pool_size 配置的与 innodb_buffer_pool_chunk_size * innodb_buffer_pool_instances不相等,或不是倍数关系,那么缓冲池自动的被调整为与 innodb_buffer_pool_chunk_size * innodb_buffer_pool_instances 相等或倍数关系,而且大小不会小于指定缓冲池大小。

下面示例中, innodb_buffer_pool_size 设置为 8G,并且 innodb_buffer_pool_instances 是设置为16innodb_buffer_pool_chunk_size 是默认值 128M

8G 是一个有效innodb_buffer_pool_size值。 因为 8Ginnodb_buffer_pool_instances=16 * innodb_buffer_pool_chunk_size=128M=2G的倍数。

shell> mysqld --innodb_buffer_pool_size=8G --innodb_buffer_pool_instances=16

mysql> SELECT @@innodb_buffer_pool_size/1024/1024/1024;
+------------------------------------------+
| @@innodb_buffer_pool_size/1024/1024/1024 |
+------------------------------------------+
|                           8.000000000000 |
+------------------------------------------+

在这个例子中, innodb_buffer_pool_size 被设置为 9G,并且 innodb_buffer_pool_instances设置为 16innodb_buffer_pool_chunk_size也是 默认大小128M。在这种情况下,9G就不是 innodb_buffer_pool_instances=16 * innodb_buffer_pool_chunk_size=128M的倍数, 因此innodb_buffer_pool_size被调整为10G,这样就是 innodb_buffer_pool_chunk_size* innodb_buffer_pool_instances的下一个倍数,而且不小于指定的缓冲池值。

shell> mysqld --innodb_buffer_pool_size=9G --innodb_buffer_pool_instances=16

mysql> SELECT @@innodb_buffer_pool_size/1024/1024/1024;
+------------------------------------------+
| @@innodb_buffer_pool_size/1024/1024/1024 |
+------------------------------------------+
|                          10.000000000000 |
+------------------------------------------+
配置InnoDB缓冲池块的大小

innodb_buffer_pool_chunk_size的大小只能是以1MB(1048576字节)为单位进行增加和减小,而且只能在服务器启动时通过命令行或MySQL配置文件来修改。

命令行:

shell> mysqld --innodb_buffer_pool_chunk_size=134217728

配置文件:

[mysqld]
innodb_buffer_pool_chunk_size=134217728

以下内容适用于修改 innodb_buffer_pool_chunk_size:

  • 如果新的innodb_buffer_pool_chunk_size 值 * innodb_buffer_pool_instances大于当前初始化时缓冲池的大小,那么 innodb_buffer_pool_chunk_size 就是 innodb_buffer_pool_size / innodb_buffer_pool_instances的截断。

    举个例子: 如果缓冲池初始化的大小是 2GB (2147483648 字节), 4 缓冲池实例,并且块大小为 1GB (1073741824 bytes),那么块被截断为与 innodb_buffer_pool_size / innodb_buffer_pool_instances相等的值,如下所示:

    shell> mysqld --innodb_buffer_pool_size=2147483648 --innodb_buffer_pool_instances=4
    --innodb_buffer_pool_chunk_size=1073741824;
    
    mysql> SELECT @@innodb_buffer_pool_size;
    +---------------------------+
    | @@innodb_buffer_pool_size |
    +---------------------------+
    |                2147483648 |
    +---------------------------+
    
    mysql> SELECT @@innodb_buffer_pool_instances;
    +--------------------------------+
    | @@innodb_buffer_pool_instances |
    +--------------------------------+
    |                              4 |
    +--------------------------------+
    
    # 启动时块的大小被设置为 1GB (1073741824 字节),但是被截断为
    # 被截断为 innodb_buffer_pool_size / innodb_buffer_pool_instances
    
    mysql> SELECT @@innodb_buffer_pool_chunk_size;
    +---------------------------------+
    | @@innodb_buffer_pool_chunk_size |
    +---------------------------------+
    |                       536870912 |
    +---------------------------------+
  • 缓冲池的大小通常必须是与 innodb_buffer_pool_chunk_size * innodb_buffer_pool_instances相等或倍数关系。 如果您调整了innodb_buffer_pool_chunk_size,那么 innodb_buffer_pool_size就被自动的调整为一个值,使得上面的关系成立,且不小于当前的缓冲池的值 这个调整行为,是在缓冲池被初始化的时候进行的,下面举例演示:

    # 缓冲池默认值是 128MB (134217728 字节)
    
    mysql> SELECT @@innodb_buffer_pool_size;
    +---------------------------+
    | @@innodb_buffer_pool_size |
    +---------------------------+
    |                 134217728 |
    +---------------------------+
    
    # 块还是为128MB (134217728 字节)
    
    mysql> SELECT @@innodb_buffer_pool_chunk_size;
    +---------------------------------+
    | @@innodb_buffer_pool_chunk_size |
    +---------------------------------+
    |                       134217728 |
    +---------------------------------+
    
    # 此时只有一个缓冲池实例
    
    mysql> SELECT @@innodb_buffer_pool_instances;
    +--------------------------------+
    | @@innodb_buffer_pool_instances |
    +--------------------------------+
    |                              1 |
    +--------------------------------+
    
    # 在启动时,块减小为 1MB (1048576 字节)
    # (134217728 - 1048576 = 133169152):
    
    shell> mysqld --innodb_buffer_pool_chunk_size=133169152
    
    mysql> SELECT @@innodb_buffer_pool_chunk_size;
    +---------------------------------+
    | @@innodb_buffer_pool_chunk_size |
    +---------------------------------+
    |                       133169152 |
    +---------------------------------+
    
    # 缓冲池从 134217728 增加到 266338304
    # 缓冲池的大小自动被调整为一个与
    # innodb_buffer_pool_chunk_size * innodb_buffer_pool_instances相等或倍数值,
    # 并且不小于当前缓冲池的大小。
    
    mysql> SELECT @@innodb_buffer_pool_size;
    +---------------------------+
    | @@innodb_buffer_pool_size |
    +---------------------------+
    |                 266338304 |
    +---------------------------+

    现在用多个缓冲池实例来验证:

    # 缓冲池默认大小为 2GB (2147483648 字节)
    
    mysql> SELECT @@innodb_buffer_pool_size;
    +---------------------------+
    | @@innodb_buffer_pool_size |
    +---------------------------+
    |                2147483648 |
    +---------------------------+
    
    # 块的大小是 .5 GB (536870912 字节)
    
    mysql> SELECT @@innodb_buffer_pool_chunk_size;
    +---------------------------------+
    | @@innodb_buffer_pool_chunk_size |
    +---------------------------------+
    |                       536870912 |
    +---------------------------------+
    
    # 现在有4个缓冲池实例
    
    mysql> SELECT @@innodb_buffer_pool_instances;
    +--------------------------------+
    | @@innodb_buffer_pool_instances |
    +--------------------------------+
    |                              4 |
    +--------------------------------+
    
    # 在启动服务器时,把块减少为 1MB (1048576 bytes)
    # (536870912 - 1048576 = 535822336):
    
    shell> mysqld --innodb_buffer_pool_chunk_size=535822336
    
    mysql> SELECT @@innodb_buffer_pool_chunk_size;
    +---------------------------------+
    | @@innodb_buffer_pool_chunk_size |
    +---------------------------------+
    |                       535822336 |
    +---------------------------------+
    
    # 缓冲池从 2147483648 增加到 4286578688
    # 缓冲池的大小自动被调整为一个与
    # innodb_buffer_pool_chunk_size * innodb_buffer_pool_instances相等或倍数值,
    # 不小于当前的缓冲池2147483648
    
    mysql> SELECT @@innodb_buffer_pool_size;
    +---------------------------+
    | @@innodb_buffer_pool_size |
    +---------------------------+
    |                4286578688 |
    +---------------------------+

    如上例所示,需要注意的是,当改变 innodb_buffer_pool_chunk_size的大小时,随着它的改变会增大缓冲池的大小。所以在改变前,需要计算一个适合的值。

注意

为了避免潜在性的性能问题,块的值 (innodb_buffer_pool_size / innodb_buffer_pool_chunk_size)应该不超过1000。

在线配置InnoDB缓冲池大小

innodb_buffer_pool_size配置选项的值可以使用 SET 语句动态的修改,这样您可以在不用重启服务的情况下调整缓冲池的大小。如:

mysql> SET GLOBAL innodb_buffer_pool_size=402653184;

在调整缓冲池的大小前,应该完成通过InnoDB API执行的活动事务和操作。 当开始调整缓冲池大小时,需要等到所有的活动事务完成后才启动操作,一旦操作正在进行,那么需要访问缓冲池的新事务和操作都必须等待,知道调整操作完成。例外的是,当对缓冲池进行碎片整理时,允许对缓冲池进行并发的访问,并且当减小缓冲池时,也会被撤销。允许并发访问的一个缺点就是,当页被撤销时,可能会导致可用页临时短缺。

注意

如果嵌套事务是在缓冲池调整操作之后启动的,可能会失败。

监控在线缓冲池调整过程

Innodb_buffer_pool_resize_status 可以反馈缓冲池调整的过程,例如:

mysql> SHOW STATUS WHERE Variable_name='InnoDB_buffer_pool_resize_status';
+----------------------------------+----------------------------------+
| Variable_name                    | Value                            |
+----------------------------------+----------------------------------+
| Innodb_buffer_pool_resize_status | Resizing also other hash tables. |
+----------------------------------+----------------------------------+

还可以在错误日志中查看,下面展示的是增加缓冲池的大小:

[Note] InnoDB: Resizing buffer pool from 134217728 to 4294967296. (unit=134217728)
[Note] InnoDB: disabled adaptive hash index.
[Note] InnoDB: buffer pool 0 : 31 chunks (253952 blocks) was added.
[Note] InnoDB: buffer pool 0 : hash tables were resized.
[Note] InnoDB: Resized hash tables at lock_sys, adaptive hash index, dictionary.
[Note] InnoDB: completed to resize buffer pool from 134217728 to 4294967296.
[Note] InnoDB: re-enabled adaptive hash index.

下面展示的是减小缓冲池的大小:

[Note] InnoDB: Resizing buffer pool from 4294967296 to 134217728. (unit=134217728)
[Note] InnoDB: disabled adaptive hash index.
[Note] InnoDB: buffer pool 0 : start to withdraw the last 253952 blocks.
[Note] InnoDB: buffer pool 0 : withdrew 253952 blocks from free list. tried to relocate 0 pages.
(253952/253952)
[Note] InnoDB: buffer pool 0 : withdrawn target 253952 blocks.
[Note] InnoDB: buffer pool 0 : 31 chunks (253952 blocks) was freed.
[Note] InnoDB: buffer pool 0 : hash tables were resized.
[Note] InnoDB: Resized hash tables at lock_sys, adaptive hash index, dictionary.
[Note] InnoDB: completed to resize buffer pool from 4294967296 to 134217728.
[Note] InnoDB: re-enabled adaptive hash index.
内部原理

调整操作是由后台线程完成的,当增大缓冲池的大小时,调整操作:

  • 添加页到 (块的大小是由 innodb_buffer_pool_chunk_size定义的)中。

  • 转换哈希表、链表和指针使用新的内存地址

  • 添加新页到空闲链表中

当进行这些操作时,但是其他线程会被阻止访问缓冲池。

当减小缓冲池的大小时,调整操作:

  • 整理缓冲池,并且去除(释放)页

  • 将页从 中移除 (chunk 默认大小由 innodb_buffer_pool_chunk_size定义)

  • 转换哈希表、链表和指针来使用新的内存地址

在这些操作中,只对缓冲池进行碎片整理,去除页,而且允许其他线程并发地访问缓冲池。

15.6.3.3 配置多个缓冲池实例

For systems with buffer pools in the multi-gigabyte range, dividing the buffer pool into separate instances can improve concurrency, by reducing contention as different threads read and write to cached pages. This feature is typically intended for systems with a buffer pool size in the multi-gigabyte range. Multiple buffer pool instances are configured using the innodb_buffer_pool_instances configuration option, and you might also adjust the innodb_buffer_pool_size value.

InnoDB 缓冲池很大时,很多请求的数据都可以从内存中满足。 buffer pool is large, many data requests can be satisfied by retrieving from memory. You might encounter bottlenecks from multiple threads trying to access the buffer pool at once. You can enable multiple buffer pools to minimize this contention. Each page that is stored in or read from the buffer pool is assigned to one of the buffer pools randomly, using a hashing function. Each buffer pool manages its own free lists, flush lists, LRUs, and all other data structures connected to a buffer pool. Prior to MySQL 8.0, each buffer pool was protected by its own buffer pool mutex. In 在MySQL 8.0 和更高版本中,缓冲池互斥被几个列表和哈希包含 以此减少争用 and later, the buffer pool mutex was replaced by several list and hash protecting mutexes, to reduce contention.

启用多个缓冲池实例,设置配置选项 innodb_buffer_pool_instances 大于1(默认是1)到64(最大值)。 这个选项只有当您将 innodb_buffer_pool_size 的大小设置为1GB或更大时才会有效。 为了更高的效率,指定组合 innodb_buffer_pool_instancesinnodb_buffer_pool_size ,以至于每个缓冲池实例至少1GB。

更多有关修改InnoDB 缓冲池的信息,请参阅 15.6.3.2 节, “配置InnoDB缓冲池大小”

15.6.3.4 标记缓冲池扫描持久

Rather than using a strict LRU algorithm, InnoDB uses a technique to minimize the amount of data that is brought into the buffer pool and never accessed again. The goal is to make sure that frequently accessed (hot) pages remain in the buffer pool, even as read-ahead and full table scans bring in new blocks that might or might not be accessed afterward.

Newly read blocks are inserted into the middle of the LRU list. All newly read pages are inserted at a location that by default is 3/8 from the tail of the LRU list. The pages are moved to the front of the list (the most-recently used end) when they are accessed in the buffer pool for the first time. Thus, pages that are never accessed never make it to the front portion of the LRU list, and age out sooner than with a strict LRU approach. This arrangement divides the LRU list into two segments, where the pages downstream of the insertion point are considered old and are desirable victims for LRU eviction.

为了解释InnoDB缓冲池的内部工作原理,以及LRU算法的细节, 参阅15.6.3.1, “The InnoDB Buffer Pool”.

You can control the insertion point in the LRU list and choose whether InnoDB applies the same optimization to blocks brought into the buffer pool by table or index scans. The configuration parameter innodb_old_blocks_pct controls the percentage of old blocks in the LRU list. The default value of innodb_old_blocks_pct is 37, corresponding to the original fixed ratio of 3/8. The value range is 5 (new pages in the buffer pool age out very quickly) to 95 (only 5% of the buffer pool is reserved for hot pages, making the algorithm close to the familiar LRU strategy).

The optimization that keeps the buffer pool from being churned by read-ahead can avoid similar problems due to table or index scans. In these scans, a data page is typically accessed a few times in quick succession and is never touched again. The configuration parameter innodb_old_blocks_time specifies the time window (in milliseconds) after the first access to a page during which it can be accessed without being moved to the front (most-recently used end) of the LRU list. The default value of innodb_old_blocks_time is 1000. Increasing this value makes more and more blocks likely to age out faster from the buffer pool.

innodb_old_blocks_pctinnodb_old_blocks_time 都是可以动态的调整,而且它们都是全局参数,并且可以聪哥MySQL选项文件 (my.cnfmy.ini) 或在运行时,使用 SET GLOBAL 命令。修改设置需要有 SYSTEM_VARIABLES_ADMINSUPER 权限。

为了评估这些参数的有效性,可以使用 SHOW ENGINE INNODB STATUS 命令来反馈缓冲池的统计信息,更详细内容,参阅 15.6.3.9 节, “使用InnoDB标准监控器监控缓冲池”

由于这些参数的影响可以根据您的硬件配置、您的数据和负载的情况而不同,因此在任何性能关键或生产环境中更改这些设置之前,都要进行基准测试来验证有效性。

在混合的工作负载中,大部分是 OLTP 类型的,且有周期性的批量报告查询,这些查询会有大型扫描。那么在运行批处理时,通过设置 innodb_old_blocks_time可以帮助保持正常工作负载下缓冲池中的工作集。

当扫描大表时,不会将全部的表放入内存中,将 innodb_old_blocks_pct设置一个较小的值,这样仅保留第一次读取的数据,而不使用缓冲池的很大一部分。如:设置 innodb_old_blocks_pct=5 约束缓冲池仅将5%,用来保留一次读取数据

当扫描适合内存的小表时,在缓冲池中移动页面的开销就会减少, 当扫描较小的表时,将合适的部分放入内存中,那么缓冲池中的移动页的开销就会减少,因此您可以将 innodb_old_blocks_pct 置为默认值,或者更高,如: innodb_old_blocks_pct=50

相比innodb_old_blocks_pct参数,innodb_old_blocks_time参数的效果更难预测,它相对较小,并且随着工作负载的变化而变化。 为了达到最佳的值,如果从调整innodb_old_blocks_pct上的性能改进是不够的,那么就执行您自己的基准测试。

15.6.3.5 配置InnoDB缓冲池预提取(Read-Ahead)

一个 预读 请求是InnoDB异步执行的将多个页预读取到缓冲池的一次I/O请求,那些页面预计很快就会被使用。 这个请求将获取一个区的所有页。InnoDB使用两种预读算法来提升I/O性能:

顺序 预读技术,是根据当前缓冲池中页被顺序访问的情况,预计哪些页可能马上会使用到。 您可以使用配置参数 innodb_read_ahead_threshold来控制 InnoDB何时执行预读操作,这个参数是 一次访问请求,顺序访问页的数量。 在这个参数被添加之前,当InnoDB当读到当前区的最后一个页时,它只会计算是否对下一个区执行一次异步预读取请求。

配置参数 innodb_read_ahead_threshold控制InnoDB 顺序页访问模式的敏感度。如果从区上顺序读取页的数量大于或等于 innodb_read_ahead_threshold,那么, InnoDB 将在整个区后续的范围发起一个异步的预读操作。 innodb_read_ahead_threshold 可以设置为0到64,默认值是56,值越高,访问模式的检测月严格。举个例子,如果您将其设置为48, InnoDB只有在当前区中的48个页被顺序的访问时,才会触发顺序预读请求。如果将值设置为8, 那么即使当前区中只有8个页被顺序的访问, InnoDB 就会触发一个异步预读。 您可以在MySQL 配置文件,或在命令行使用命令 SET GLOBAL 动态的修改这个参数的值,当然这需要有 SYSTEM_VARIABLES_ADMINSUPER 权限。

随机 预读技术,根据缓冲池中已有的页面,预测页面何时可能需要,而不考虑页面被读取的顺序。如果在缓冲池中发现了13个连续的页面,InnoDB会异步发出请求,以预先读取区的剩余页面。 设置配置参数 innodb_random_read_aheadON来启用这个特性。

命令SHOW ENGINE INNODB STATUS 显示的统计信息可以帮助您评估预读算法的效果。统计信息包括一下全局状态变量的计数器信息:

当进行innodb_random_read_ahead设置微调时这些信息可能非常有用。

更多有关I/O性能的信息,请参阅 8.5.8 节, “优化InnoDB 磁盘I/O”8.12.1节, “优化磁盘 I/O”

15.6.3.6 配置InnoDB缓冲池Flushing

InnoDB 在后台执行的任务包括,从缓冲池 flushing 脏页 (这些是被修改过但是没有写入数据文件的页 )。

当缓冲池中的脏页百分比达到了最低水位线,那么 InnoDB 开始 flush缓冲池,这个最低水位线是由 innodb_max_dirty_pages_pct_lwm定义的。这个选项意在控制缓冲池中脏页的比例,并理想的防止脏页的百分比达到 innodb_max_dirty_pages_pct。如果缓冲池中脏页的百分比超过了 innodb_max_dirty_pages_pct,则 InnoDB 主动开始flush缓冲池页。

InnoDB使用一种算法来估算flush的速率,这个算法是根据重做日志产生的速度和当前flush的速率。其目的就是通过 保持缓冲池干净,来确保缓冲池中的活动的持续。 自动调整flush速率有助于避免吞吐能力突然下降,因为过多的flush缓冲池会限制正常读写的I/O能力。

InnoDB uses its log files in a circular fashion. Before reusing a portion of a log file, InnoDB flushes to disk all dirty buffer pool pages whose redo entries are contained in that portion of the log file, a process known as a sharp checkpoint. If a workload is write-intensive, it generates a lot of redo information, all written to the log file. If all available space in the log files is used up, a sharp checkpoint occurs, causing a temporary reduction in throughput. This situation can happen even if innodb_max_dirty_pages_pct is not reached.

InnoDB uses a heuristic-based algorithm to avoid such a scenario, by measuring the int of dirty pages in the buffer pool and the rate at which redo is being generated. Based on these ints, InnoDB decides how many dirty pages to flush from the buffer pool each second. This self-adapting algorithm is able to deal with sudden changes in workload.

内部基准测试表明,该算法不仅能随着时间的推移保持吞吐量,而且还能显著提高总体吞吐量。

由于自适应flush能显著的影响工作负载的 I/O,那么您可以使用配置参数 innodb_adaptive_flushing来关闭这个特性。这个参数的默认值是 ON,默认开启了自适应flush算法。这个参数可以在选项文件中配置,也可以在命令行动态的修改。

15.6.3.7 InnoDB缓冲池Flushing微调

The configuration options innodb_flush_neighbors and innodb_lru_scan_depth let you fine-tune certain aspects of the flushing process for the InnoDB buffer pool. These options primarily help write-intensive workloads. With heavy DML activity, flushing can fall behind if it is not aggressive enough, resulting in excessive memory use in the buffer pool; or, disk writes due to flushing can saturate your I/O capacity if that mechanism is too aggressive. The ideal settings depend on your workload, data access patterns, and storage configuration (for example, whether data is stored on HDD or SSD devices).

For systems with constant heavy workloads, or workloads that fluctuate widely, several configuration options let you fine-tune the flushing behavior for InnoDB tables:

These options feed into the formula used by the innodb_adaptive_flushing option.

The innodb_adaptive_flushing, innodb_io_capacity and innodb_max_dirty_pages_pct options are limited or extended by the following options:

The InnoDB adaptive flushing mechanism is not appropriate in all cases. It gives the most benefit when the redo log is in danger of filling up. The innodb_adaptive_flushing_lwm option specifies a low water mark percentage of redo log capacity; when that threshold is crossed, InnoDB turns on adaptive flushing even if not specified by the innodb_adaptive_flushing option.

If flushing activity falls far behind, InnoDB can flush more aggressively than specified by innodb_io_capacity. innodb_io_capacity_max represents an upper limit on the I/O capacity used in such emergency situations, so that the spike in I/O does not consume all the capacity of the server.

InnoDB tries to flush data from the buffer pool so that the percentage of dirty pages does not exceed the value of innodb_max_dirty_pages_pct. The default value for innodb_max_dirty_pages_pct is 75.

Note

The innodb_max_dirty_pages_pct setting establishes a target for flushing activity. It does not affect the rate of flushing. For information about managing the rate of flushing, see 15.6.3.6 节, “Configuring InnoDB Buffer Pool Flushing”.

The innodb_max_dirty_pages_pct_lwm option specifies a low water mark value that represents the percentage of dirty pages where pre-flushing is enabled to control the dirty page ratio and ideally prevent the percentage of dirty pages from reaching innodb_max_dirty_pages_pct. A value of innodb_max_dirty_pages_pct_lwm=0 disables the pre-flushing behavior.

Most of the options referenced above are most applicable to servers that run write-heavy workloads for long periods of time and have little reduced load time to catch up with changes waiting to be written to disk.

innodb_flushing_avg_loops defines the int of iterations for which InnoDB keeps the previously calculated snapshot of the flushing state, which controls how quickly adaptive flushing responds to foreground load changes. Setting a high value for innodb_flushing_avg_loops means that InnoDB keeps the previously calculated snapshot longer, so adaptive flushing responds more slowly. A high value also reduces positive feedback between foreground and background work, but when setting a high value it is important to ensure that InnoDB redo log utilization does not reach 75% (the hardcoded limit at which async flushing starts) and that the innodb_max_dirty_pages_pct setting keeps the int of dirty pages to a level that is appropriate for the workload.

Systems with consistent workloads, a large innodb_log_file_size, and small spikes that do not reach 75% redo log space utilization should use a high innodb_flushing_avg_loops value to keep flushing as smooth as possible. For systems with extreme load spikes or log files that do not provide a lot of space, consider a smaller innodb_flushing_avg_loops value. A smaller value allows flushing to closely track the load and helps avoid reaching 75% redo log space utilization.

15.6.3.8 保存和恢复缓冲池的状态

To reduce the warmup period after restarting the server, InnoDB saves a percentage of the most recently used pages for each buffer pool at server shutdown and restores these pages at server startup. The percentage of recently used pages that is stored is defined by the innodb_buffer_pool_dump_at_shutdown configuration option.

After restarting a busy server, there is typically a warmup period with steadily increasing throughput, as disk pages that were in the buffer pool are brought back into memory (as the same data is queried, updated, and so on). The ability to restore the buffer pool at startup shortens the warmup period by reloading disk pages that were in the buffer pool before the restart rather than waiting for DML operations to access corresponding rows. Also, I/O requests can be performed in large batches, making the overall I/O faster. Page loading happens in the background, and does not delay database startup.

In addition to saving the buffer pool state at shutdown and restoring it at startup, you can save and restore the buffer pool state at any time, while the server is running. For example, you can save the state of the buffer pool after reaching a stable throughput under a steady workload. You could also restore the previous buffer pool state after running reports or maintenance jobs that bring data pages into the buffer pool that are only requited for those operations, or after running some other non-typical workload.

Even though a buffer pool can be many gigabytes in size, the buffer pool data that InnoDB saves to disk is tiny by comparison. Only tablespace IDs and page IDs necessary to locate the appropriate pages are saved to disk. This information is derived from the INNODB_BUFFER_PAGE_LRU INFORMATION_SCHEMA table. 默认情况下, tablespace ID and page ID data is saved in a file named ib_buffer_pool, which is saved to the InnoDB data directory. The file name and location can be modified using the innodb_buffer_pool_filename configuration parameter.

Because data is cached in and aged out of the buffer pool as it is with regular database operations, there is no problem if the disk pages are recently updated, or if a DML operation involves data that has not yet been loaded. The loading mechanism skips requested pages that no longer exist.

The underlying mechanism involves a background thread that is dispatched to perform the dump and load operations.

Disk pages from compressed tables are loaded into the buffer pool in their compressed form. Pages are uncompressed as usual when page contents are accessed during DML operations. Because uncompressing pages is a CPU-intensive process, it is more efficient for concurrency to perform the operation in a connection thread rather than in the single thread that performs the buffer pool restore operation.

Operations related to saving and restoring the buffer pool state are described in the following topics:

Configuring the Dump Percentage for Buffer Pool Pages

Before dumping pages from the buffer pool, you can configure the percentage of most-recently-used buffer pool pages that you want to dump by setting the innodb_buffer_pool_dump_pct option. If you plan to dump buffer pool pages while the server is running, you can configure the option dynamically:

SET GLOBAL innodb_buffer_pool_dump_pct=40;

If you plan to dump buffer pool pages at server shutdown, set innodb_buffer_pool_dump_pct in your configuration file.

[mysqld]
      innodb_buffer_pool_dump_pct=40

The innodb_buffer_pool_dump_pct default value is 25 (dump 25% of most-recently-used pages).

Saving the Buffer Pool State at Shutdown and Restoring it at Startup

To save the state of the buffer pool at server shutdown, issue the following statement prior to shutting down the server:

SET GLOBAL innodb_buffer_pool_dump_at_shutdown=ON;

innodb_buffer_pool_dump_at_shutdown is enabled by default.

To restore the buffer pool state at server startup, specify the --innodb_buffer_pool_load_at_startup option when starting the server:

mysqld --innodb_buffer_pool_load_at_startup=ON;

innodb_buffer_pool_load_at_startup is enabled by default.

Saving and Restoring the Buffer Pool State Online

To save the state of the buffer pool while MySQL server is running, issue the following statement:

SET GLOBAL innodb_buffer_pool_dump_now=ON;

To restore the buffer pool state while MySQL is running, issue the following statement:

SET GLOBAL innodb_buffer_pool_load_now=ON;
Displaying Buffer Pool Dump Progress

To display progress when saving the buffer pool state to disk, issue the following statement:

SHOW STATUS LIKE 'Innodb_buffer_pool_dump_status';

If the operation has not yet started, not started is returned. If the operation is complete, the completion time is printed (e.g. Finished at 110505 12:18:02). If the operation is in progress, status information is provided (e.g. Dumping buffer pool 5/7, page 237/2873).

Displaying Buffer Pool Load Progress

To display progress when loading the buffer pool, issue the following statement:

SHOW STATUS LIKE 'Innodb_buffer_pool_load_status';

If the operation has not yet started, not started is returned. If the operation is complete, the completion time is printed (e.g. Finished at 110505 12:23:24). If the operation is in progress, status information is provided (e.g. Loaded 123/22301 pages).

Aborting a Buffer Pool Load Operation

To abort a buffer pool load operation, issue the following statement:

SET GLOBAL innodb_buffer_pool_load_abort=ON;
Monitoring Buffer Pool Load Progress Using Performance Schema

You can monitor buffer pool load progress using Performance Schema.

The following example demonstrates how to enable the stage/innodb/buffer pool load stage event instrument and related consumer tables to monitor buffer pool load progress.

For information about buffer pool dump and load procedures used in this example, see 15.6.3.8 节, “Saving and Restoring the Buffer Pool State”. For information about Performance Schema stage event instruments and related consumers, see Section 25.11.5 节, “Performance Schema Stage Event Tables”.

  1. Enable the stage/innodb/buffer pool load instrument:

    mysql> UPDATE performance_schema.setup_instruments SET ENABLED = 'YES' 
           WHERE NAME LIKE 'stage/innodb/buffer%';
    
  2. Enable the stage event consumer tables, which include events_stages_current, events_stages_history, and events_stages_history_long.

    mysql> UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' 
           WHERE NAME LIKE '%stages%';
    
  3. Dump the current buffer pool state by enabling innodb_buffer_pool_dump_now.

    mysql> SET GLOBAL innodb_buffer_pool_dump_now=ON;
    
  4. Check the buffer pool dump status to ensure that the operation has completed.

    mysql> SHOW STATUS LIKE 'Innodb_buffer_pool_dump_status'\G
    *************************** 1. row ***************************
    Variable_name: Innodb_buffer_pool_dump_status
            Value: Buffer pool(s) dump completed at 150202 16:38:58
    
  5. Load the buffer pool by enabling innodb_buffer_pool_load_now:

    mysql> SET GLOBAL innodb_buffer_pool_load_now=ON;
    
  6. Check the current status of the buffer pool load operation by querying the Performance Schema events_stages_current table. The WORK_COMPLETED column shows the int of buffer pool pages loaded. The WORK_ESTIMATED column provides an estimate of the remaining work, in pages.

    mysql> SELECT EVENT_NAME, WORK_COMPLETED, WORK_ESTIMATED
           FROM performance_schema.events_stages_current;
    +-------------------------------+----------------+----------------+
    | EVENT_NAME                    | WORK_COMPLETED | WORK_ESTIMATED |
    +-------------------------------+----------------+----------------+
    | stage/innodb/buffer pool load |           5353 |           7167 |
    +-------------------------------+----------------+----------------+
    

    The events_stages_current table returns an empty set if the buffer pool load operation has completed. In this case, you can check the events_stages_history table to view data for the completed event. For example:

    mysql> SELECT EVENT_NAME, WORK_COMPLETED, WORK_ESTIMATED 
           FROM performance_schema.events_stages_history;
    +-------------------------------+----------------+----------------+
    | EVENT_NAME                    | WORK_COMPLETED | WORK_ESTIMATED |
    +-------------------------------+----------------+----------------+
    | stage/innodb/buffer pool load |           7167 |           7167 |
    +-------------------------------+----------------+----------------+
    
注意

You can also monitor buffer pool load progress using Performance Schema when loading the buffer pool at startup using innodb_buffer_pool_load_at_startup. In this case, the stage/innodb/buffer pool load instrument and related consumers must be enabled at startup. For more information, see Section 25.3 节, “Performance Schema Startup Configuration”.

15.6.3.9 使用InnoDB标准监控器监控缓冲池

InnoDB Standard Monitor output, which can be accessed using SHOW ENGINE INNODB STATUS, provides metrics that pertain to operation of the InnoDB buffer pool. Buffer pool metrics are located in the BUFFER POOL AND MEMORY section of InnoDB Standard Monitor output and appear similar to the following:

----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 2198863872
Dictionary memory allocated 776332
Buffer pool size   131072
Free buffers       124908
Database pages     5720
Old database pages 2071
Modified db pages  910
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 4, not young 0
0.10 youngs/s, 0.00 non-youngs/s
Pages read 197, created 5523, written 5060
0.00 reads/s, 190.89 creates/s, 244.94 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not
0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read
ahead 0.00/s
LRU len: 5720, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]

The following table describes InnoDB buffer pool metrics reported by the InnoDB Standard Monitor.

Note

Per second averages provided in InnoDB Standard Monitor output are based on the elapsed time since InnoDB Standard Monitor output was last printed.

Table 15.2 InnoDB Buffer Pool Metrics

NameDescription
Total memory allocatedThe total memory allocated for the buffer pool in bytes.
Dictionary memory allocatedThe total memory allocated for the InnoDB data dictionary in bytes.
Buffer pool sizeThe total size in pages allocated to the buffer pool.
Free buffersThe total size in pages of the buffer pool free list.
Database pagesThe total size in pages of the buffer pool LRU list.
Old database pagesThe total size in pages of the buffer pool old LRU sublist.
Modified db pagesThe current int of pages modified in the buffer pool.
Pending readsThe int of buffer pool pages waiting to be read in to the buffer pool.
Pending writes LRUThe int of old dirty pages within the buffer pool to be written from the bottom of the LRU list.
Pending writes flush listThe int of buffer pool pages to be flushed during checkpointing.
Pending writes single pageThe int of pending independent page writes within the buffer pool.
Pages made youngThe total int of pages made young in the buffer pool LRU list (moved to the head of sublist of new pages).
Pages made not youngThe total int of pages not made young in the buffer pool LRU list (pages that have remained in the old sublist without being made young).
youngs/sThe per second average of accesses to old pages in the buffer pool LRU list that have resulted in making pages young. See the notes that follow this table for more information.
non-youngs/sThe per second average of accesses to old pages in the buffer pool LRU list that have resulted in not making pages young. See the notes that follow this table for more information.
Pages readThe total int of pages read from the buffer pool.
Pages createdThe total int of pages created within the buffer pool.
Pages writtenThe total int of pages written from the buffer pool.
reads/sThe per second average int of buffer pool page reads per second.
creates/sThe per second average int of buffer pool pages created per second.
writes/sThe per second average int of buffer pool page writes per second.
Buffer pool hit rateThe buffer pool page hit rate for pages read from the buffer pool memory vs from disk storage.
young-making rateThe average hit rate at which page accesses have resulted in making pages young. See the notes that follow this table for more information.
not (young-making rate)The average hit rate at which page accesses have not resulted in making pages young. See the notes that follow this table for more information.
Pages read aheadThe per second average of read ahead operations.
Pages evicted without accessThe per second average of the pages evicted without being accessed from the buffer pool.
Random read aheadThe per second average of random read ahead operations.
LRU lenThe total size in pages of the buffer pool LRU list.
unzip_LRU lenThe total size in pages of the buffer pool unzip_LRU list.
I/O sumThe total int of buffer pool LRU list pages accessed, for the last 50 seconds.
I/O curThe total int of buffer pool LRU list pages accessed.
I/O unzip sumThe total int of buffer pool unzip_LRU list pages accessed.
I/O unzip curThe total int of buffer pool unzip_LRU list pages accessed.

注意:

  • The youngs/s metric only relates to old pages. It is based on the int of accesses to pages and not the int of pages. There can be multiple accesses to a given page, all of which are counted. If you see very low youngs/s values when there are no large scans occurring, you might need to reduce the delay time or increase the percentage of the buffer pool used for the old sublist. Increasing the percentage makes the old sublist larger, so pages in that sublist take longer to move to the tail and to be evicted. This increases the likelihood that the pages will be accessed again and be made young.

  • The non-youngs/s metric only relates to old pages. It is based on the int of accesses to pages and not the int of pages. There can be multiple accesses to a given page, all of which are counted. If you do not see a lot of non-youngs/s when you are doing large table scans (and lots of youngs/s), increase the delay value.

  • The young-making rate accounts for accesses to all buffer pool pages, not just accesses to pages in the old sublist. The young-making rate and not rate do not normally add up to the overall buffer pool hit rate. Page hits in the old sublist cause pages to move to the new sublist, but page hits in the new sublist cause pages to move to the head of the list only if they are a certain distance from the head.

  • not (young-making rate) is the average hit rate at which page accesses have not resulted in making pages young due to the delay defined by innodb_old_blocks_time not being met, or due to page hits in the new sublist that did not result in pages being moved to the head. This rate accounts for accesses to all buffer pool pages, not just accesses to pages in the old sublist.

InnoDB 缓冲池 服务器状态变量INNODB_BUFFER_POOL_STATS 表提供了许多与 InnoDB标准监控器输出中相同的缓冲池指标。更多有关 INNODB_BUFFER_POOL_STATS 表的信息,请参阅 示例 15.10, “查询INNODB_BUFFER_POOL_STATS表”

15.6.4 InnoDB的Change buffer配置

当对表执行INSERTUPDATE、和 DELETE 操作时,索引列的值 (特别是辅助索引的键值)通常是处于未排序的,需要大量的I/O来更新辅助索引。 InnoDB 有一个 change buffer ,当相关页不在缓冲池中时,它会缓存辅助索引条目的变更。因此避免从磁盘立即读取页而产生的昂贵i/o操作。当页加载到缓冲池时,被缓冲的变更将被合并,更新后的页随后会被flush到磁盘。当服务器空闲时和服务器slow shutdown期间,InnoDB主线程合并缓冲的变更。

因为它可以导致更少的磁盘读和写,change buffer 特性对于I/O密集型的负载来说最有价值。如,具有大量DML操作(批量插入)的应用程序。

但是,change buffer占用了缓冲池的一部分,减少了可用于缓存数据页的内存。如果工作集几乎都在缓冲池中,或者您的表有较少的辅助索引,那么禁用change buffer可能会很有用。如果工作集全部都在缓冲池中,那么change buffer就不会带来额外的开销,因为它只适用于不在缓冲池的页。

您可以使用 innodb_change_buffering配置参数来控制InnoDB执行change buffer的范围。 您可以启用或禁用对insert、delete操作(当索引记录最初被标记为删除时) 和 purge 操作(当索引记录是物理的删除)的buffering。 update操作是插入和删除的组合。 innodb_change_buffering 的默认值是 all

允许的innodb_change_buffering 值包括:

  • all

    默认值:缓冲插入、标记删除操作和purge。

  • none

    不缓冲任何操作。

  • inserts

    缓冲插入操作。

  • deletes

    缓冲标记删除操作。

  • changes

    缓冲插入和标记删除操作。

  • purges

    缓冲后台发生的物理删除操作。

您可以在MySQL选项文件 (my.cnfmy.ini)设置InnoDBinnodb_change_buffering的值,或者在命令行使用 SET GLOBAL命令动态的调整。动态调整需要有 SYSTEM_VARIABLES_ADMINSUPER 权限。调整设置会影响新的buffering操作,不会影响现有的buffer条目。

如果辅助索引中包含一个降序列,或者主键包含一个降序的索引列,那么就不支持被Change buffer。

15.6.4.1  配置Change Buffer最大值

配置选项 innodb_change_buffer_max_size用来配置change buffer占用全部缓冲池的最大百分比。默认情况下, innodb_change_buffer_max_size设置的是25。最大值是50.

如果您的MySQL服务器有大量的插入、更新和删除活动,此时旧的change buffer和并操作不能与新change buffer条目并驾齐驱,因为change buffer达到了最大值限制,那么您可以考虑增大innodb_change_buffer_max_size的值。

如果您的MySQL服务器是一些用于报表的静态数据,或者由于change buffer占用了过多的缓冲池的共享空间,而使得页 比预期更快地从缓冲池中退出,那么您可以考虑降低 innodb_change_buffer_max_size

因此使用具有代表性的压力负载来测试不同的设置,以确定最优配置。 innodb_change_buffer_max_size是动态设置,这样就可以在不重启服务器的情况修改。

15.6.5 配置InnoDB并发线程

InnoDB 使用操作系统的线程来处理用户的事务请求。(事务在提交或回滚之前可能向 InnoDB 发出了很多请求。) 在具有多核处理器的现代操作系统和服务器上,在上下文切换有效的情况下,大多数负载在没有并发线程限制的情况下运行良好。

在这种情况下,最小化线程之间的上下文切换是很有帮助的, InnoDB 使用一种技术,来限制并发的执行操作系统线程数(因此任何时候都可以处理请求)。 当InnoDB从一个用户会话接受到一个新的请求时,如果此时并发线程数已经达到了设置的限制值时,这个新的请求会短暂的休眠,然后再次尝试。休眠之后不能被重新安排的请求会被放入先进先出的队列中,并最终被处理。等待锁的线程不计入并发执行的线程数据中。

您可以通过设置配置参数innodb_thread_concurrency来限制并发线程数。一旦正在执行工作的线程达到了这个限制,想需要更多的线程,那么这些额外的线程在被放入队列前会休眠,休眠时间由 innodb_thread_sleep_delay控制,单位:微秒。

您可以设置配置选项 innodb_adaptive_max_sleep_delay 来设置您允许 innodb_thread_sleep_delay的最长时间。那么 InnoDB 会根据当前线程调度的活动情况,自动的调整 innodb_thread_sleep_delay的长短。这种动态调整有助于线程调度机制在系统负载很小和接近满负载的时候平稳工作。

innodb_thread_concurrency的默认值和隐含的默认限制值,是由MySQL的版本和 InnoDB所决定的。 innodb_thread_concurrency的默认值是 0,因此,默认的并发执行线程是没有限制的。

只有并发线程达到了限制数,才会导致线性休眠,当没有线程数量限制时,所有的的线程都是被平等的调度,也就是说,如果 innodb_thread_concurrency0,那么 innodb_thread_sleep_delay的值就被忽略。

当有线程限制的时(当 innodb_thread_concurrency 是 > 0时),InnoDB 通过允许在执行 单个 SQL 语句 时有多个请求,而且不用遵守innodb_thread_concurrency限制,一次来减少上下文切换的开销。由于SQL语句(such as a join)在InnoDB中可能包含多行的操作。因此InnoDB 会分配一个指定的 票据 数,以便让线程在最小的开销下反复被调度。

当一个新的SQL语句开始时,如果这个线程没有票据,那么它就必须遵守 innodb_thread_concurrency。一旦线程被授权进入到 InnoDB,它将被分配票据,这样它随后就可以用票据仅需 InnoDB执行行操作。如果票据用完了,那么线程将被驱逐,并且又要遵守 innodb_thread_concurrency,而且线程可能被放入等待队列中。当线程再次被授权进入 InnoDB时,又会被授予票据。票据的数量可以通过全局选项 innodb_concurrency_tickets设定,默认是5000。一旦锁可用,那么将会授予票据给正在等待的线程。

这些变量的值是要视环境和负载的情况而定。可以尝试不同值来确定适合您应用程序的最佳值。在设置并发执行线程的数量前,请查看可以提高 InnoDB在多核和多处理器机器上的配置选项。如: innodb_adaptive_hash_index

15.6.6 配置InnoDB后台I/O线程数

InnoDB 使用后台线程来服务于多种类型的I/O请求。您可以使用 innodb_read_io_threadsinnodb_write_io_threads配置参数来配置这些服务于数据读和写的后台线程的数量。这些参数是表示用于读和写请求的后台线程数量。 这些参数在所有的支持MySQL的平台上都有效,您可以使用MySQL选项文件来设置这些参数值 (my.cnfmy.ini); 它们不能动态修改,默认参数值都是 4 并且允许的范围是 1-64

这些配置选项的目的是使得 InnoDB 在高端的系统中更可扩展性。每个后台线程可以处理高达256个待处理的I/O请求。后台I/O的主要来源是 预读 请求。 InnoDB 试图以这样的方式平衡传入请求的负载,这样大部分后台线能够平等共享工作。 InnoDB 还试图将对相同区的读请求分配给相同的线程。以增加合并请求的机会。如果您有一个高端I/O子系统,在 SHOW ENGINE INNODB STATUS输出中,您可以看到多于 64 × innodb_read_io_threads个待处理的读请求。您可以增加 innodb_read_io_threads的值来提升性能。

在Linux 系统中,InnoDB 默认使用异步 I/O 子系统对数据文件页执行预读和写请求,这改变了InnoDB后台线程服务这些I/O请求的类型的方式。更多信息,请参阅 15.6.7 节, “Linux中使用异步 I/O ”

15.6.7 Linux中使用异步 I/O

InnoDB 使用Linux系统的异步 I/O 子系统(本地 AIO) 来对数据文件页执行预读和写请求。这个行为是有配置选项 innodb_use_native_aio控制的,仅适用于Linux系统中,而且默认是启用的。在其它类Unix系统上, InnoDB 仅使用同步 I/O ;在Windows系统的使用史上, InnoDB 只使用异步 I/O 。在Linux系统中使用异步I/O子系统需要有 libaio 库。

With synchronous I/O, query threads queue I/O requests, and InnoDB background threads retrieve the queued requests one at a time, issuing a synchronous I/O call for each. When an I/O request is completed and the I/O call returns, the InnoDB background thread that is handling the request calls an I/O completion routine and returns to process the next request. The int of requests that can be processed in parallel is n, where n is the int of InnoDB background threads. The int of InnoDB background threads is controlled by innodb_read_io_threadsandinnodb_write_io_threads. See 15.6.6 节, “Configuring the int of Background InnoDB I/O Threads”.

With native AIO, query threads dispatch I/O requests directly to the operating system, thereby removing the limit imposed by the int of background threads. InnoDB background threads wait for I/O events to signal completed requests. When a request is completed, a background thread calls an I/O completion routine and resumes waiting for I/O events.

The advantage of native AIO is scalability for heavily I/O-bound systems that typically show many pending reads/writes in SHOW ENGINE INNODB STATUS\G output. The increase in parallel processing when using native AIO means that the type of I/O scheduler or properties of the disk array controller have a greater influence on I/O performance.

A potential disadvantage of native AIO for heavily I/O-bound systems is lack of control over the int of I/O write requests dispatched to the operating system at once. Too many I/O write requests dispatched to the operating system for parallel processing could, in some cases, result in I/O read starvation, depending on the amount of I/O activity and system capabilities.

如果由于OS的异步I/O子系统的问题阻止InnoDB启动,那么您可以使用 innodb_use_native_aio=0启动服务器。 如果在服务器启动期间,InnoDB检测到潜在的问题时,这个选项可能会被自动的关闭。例如,置于tmpfs文件系统上的tmpdir,但是Linux内核不支持在tmpfs上异步I/O。

15.6.8 配置InnoDB Master线程的 I/O 速度

master线程 in 是InnoDB 中一个执行多种任务的后台线程。这些任务中的大部分都是与I/O 相关的,如,flush缓冲池中的脏页,或者将insert buffer中的变更写到相应的辅助索引中。 master线程尝试在不影响服务的正常工作的方式下执行这些任务。它会尝试估算可用的空闲的I/O带宽,并调整其他活动以利用这些空闲容量。纵观历史,InnoDB已经使用了 100 IOPs (每秒输入/输出操作) 的硬编码值作为服务器的总 I/O 容量。

参数innodb_io_capacity表示,InnoDB可用的全部I/O容量,你改参数应设置为近似系统每秒能执行I/O操作的数,所以这个值是要根据您的系统配置。当设置了 innodb_io_capacity,master线程会根据设置的值来估算可用于后台任务的 I/O 带宽。设置为 100 就回到了旧的方式。

您可用将innodb_io_capacity的值设置为100或更大,默认值是 200,这反映出现代设备的I/O 性能要高于早期的MySQL。通常情况下,以前默认的100适合于当时消费级别的存储,如,高达7200转的硬盘。因此,当使用了更快硬盘,RAID或SSD时,会受益于更高的值。

innodb_io_capacity是所有缓冲池实例的总限制。当flush脏页的时候, 限制的innodb_io_capacity值,被平均的分给所有缓冲池实例。更多信息,请参阅 innodb_io_capacity 系统变量描述。

您可以在MySQL选项文件(my.cnfmy.ini)中设置这个参数的值 ,或者在命令行上使用SET GLOBAL命令动态的调整这个值,但是这需要有 SYSTEM_VARIABLES_ADMINSUPER 权限。

配置选项innodb_flush_sync 会引起在检查点期间I/O活动暴增时忽略 innodb_io_capacity的设置。而 innodb_flush_sync 是默认开启的。

在早期版本的MySQL,InnoDB master线程还可以执行任何需要的 purge 操作。那些I/O操作现在都是由其他后台线程执行,它们的数量由 innodb_purge_threads配置选项控制。

更多有关InnoDB I/O 性能的信息,请参阅 8.5.8 节, “InnoDB磁盘 I/O优化”

15.6.9 配置自旋锁轮询

很多InnoDB 互斥锁rw-锁 被保留很短的时间。在多核系统上,线程可以更有效的连续不断的检查,它 能否在休眠前暂时获得一个互斥锁或rw-锁。如果在轮询期间可以获得,那么这个线程可以在相同的时间片内立刻继续进行。但是, 多个线程过于繁忙的轮询可能导致cache ping pong。 不同的处理器会使彼此的缓存的部分无效,InnoDB通过在随后的轮询间等待一个随机的时间来最小化这个问题,延迟是用一个繁忙的循环实现的。

您可以使用参数 innodb_spin_wait_delay来控制最大延迟时间。 延迟循环的持续时间取决于C编译器和目标处理器。(在100MHz奔腾年代,延迟的单位是1微秒),在一个所有处理器内核共享一个高速缓存内存的系统,可以通过设置 innodb_spin_wait_delay=0来减少最大延迟,或完全禁用掉。在多个处理器芯片的系统中,缓存失效的影响可能会更大,那么您可以增大最大延迟。

innodb_spin_wait_delay默认是 6。这个参数是动态的全局参数。因此您可以在MySQL选项文件 (my.cnf or my.ini)中配置这个参数,或者在服务器运行时使用命令 SET GLOBAL innodb_spin_wait_delay=delay, 改变这个值,其中delay是表示最大延迟。改变这个设置需要有 SYSTEM_VARIABLES_ADMINSUPER 权限。

更多有关InnoDB 锁定操作的信息,请参阅 8.11, “优化锁定操作”

15.6.10 InnoDB Purge调度配置

The purge operations (a type of garbage collection) that InnoDB performs automatically may be performed by one or more separate threads rather than as part of the master thread. The use of separate threads improves scalability by allowing the main database operations to run independently from maintenance work happening in the background.

增加配置选项innodb_purge_threads的值来控制这个特性。如果 DML操作几种在一个表或多个表上,那么将该值设置小,这样线程就不会为了访问繁忙的表,而相互争用。如果DML操作是分散在多个表中,那么可以增大这个值,最大值是32 innodb_purge_threads 是非动态配置选项,这就意味着不能在服务器运行时配置它。

另外一个相关的配置选, innodb_purge_batch_size 默认值是300,最大值是5000。 这个选项主要是用于实验和调试purge操作,所以常用用户不要对这个有很大兴趣。

更多有关InnoDB I/O 性能的信息,请参阅 8.5.8 节, “InnoDB磁盘I/O调优”

15.6.11 InnoDB优化器统计配置

本节描述如何为InnoDB表配置持久和非持久优化器统计。

Persistent optimizer statistics are persisted across server restarts, allowing for greater plan stability and more consistent query performance. Persistent optimizer statistics also provide control and flexibility with these additional benefits:

  • You can use the innodb_stats_auto_recalc configuration option to control whether statistics are updated automatically after substantial changes to a table.

  • You can use the STATS_PERSISTENT, STATS_AUTO_RECALC, and STATS_SAMPLE_PAGES clauses with CREATE TABLEALTER TABLE语句 statements to configure optimizer statistics for individual tables.

  • 您可以查询 mysql.innodb_table_statsmysql.innodb_index_stats 表中的优化器统计数据。

  • 您可以看mysql.innodb_table_statsmysql.innodb_index_stats 表中的 last_update列,查看最新的统计是什么时候的。

  • You can manually modify the mysql.innodb_table_stats and mysql.innodb_index_stats tables to force a specific query optimization plan or to test alternative plans without modifying the database.

持久性优化器统计特性默认是开始的 (innodb_stats_persistent=ON)。

Non-persistent optimizer statistics are cleared on each server restart and after some other operations, and recomputed on the next table access. As a result, different estimates could be produced when recomputing statistics, leading to different choices in execution plans and variations in query performance.

This section also provides information about estimating ANALYZE TABLE complexity, which may be useful when attempting to achieve a balance between accurate statistics and ANALYZE TABLE execution time.

15.6.11.1 持久性优化器统计参数配置

The persistent optimizer statistics feature improves plan stability by storing statistics to disk and making them persistent across server restarts so that the optimizer is more likely to make consistent choices each time for a given query.

Optimizer statistics are persisted to disk when innodb_stats_persistent=ON or when individual tables are created or altered with STATS_PERSISTENT=1. innodb_stats_persistent is enabled by default.

Formerly, optimizer statistics were cleared on each server restart and after some other operations, and recomputed on the next table access. Consequently, different estimates could be produced when recalculating statistics, leading to different choices in query execution plans and thus variations in query performance.

Persistent statistics are stored in the mysql.innodb_table_stats and mysql.innodb_index_stats tables, as described in 15.6.11.1.5 节, “InnoDB Persistent Statistics Tables”.

To revert to using non-persistent optimizer statistics, you can modify tables using an ALTER TABLE tbl_name STATS_PERSISTENT=0 statement. For related information, see 15.6.11.2 节, “Configuring Non-Persistent Optimizer Statistics Parameters”

15.6.11.1.1 持久性优化器统计配置自动统计计算

The innodb_stats_auto_recalc configuration option, which is enabled 默认情况下, determines whether statistics are calculated automatically whenever a table undergoes substantial changes (to more than 10% of the rows). You can also configure automatic statistics recalculation for individual tables using a STATS_AUTO_RECALC clause in a CREATE TABLE or ALTER TABLE statement. innodb_stats_auto_recalc is enabled by default.

Because of the asynchronous nature of automatic statistics recalculation (which occurs in the background), statistics may not be recalculated instantly after running a DML operation that affects more than 10% of a table, even when innodb_stats_auto_recalc is enabled. In some cases, statistics recalculation may be delayed by a few seconds. If up-to-date statistics are required immediately after changing significant portions of a table, run ANALYZE TABLE to initiate a synchronous (foreground) recalculation of statistics.

If innodb_stats_auto_recalc is disabled, ensure the accuracy of optimizer statistics by issuing the ANALYZE TABLE statement for each applicable table after making substantial changes to indexed columns. You might run this statement in your setup scripts after representative data has been loaded into the table, and run it periodically after DML operations significantly change the contents of indexed columns, or on a schedule at times of low activity. When a new index is added to an existing table, index statistics are calculated and added to the innodb_index_stats table regardless of the value of innodb_stats_auto_recalc.

Caution

To ensure statistics are gathered when a new index is created, either enable the innodb_stats_auto_recalc option, or run ANALYZE TABLE after creating each new index when the persistent statistics mode is enabled.

15.6.11.1.2 配置单个表的优化器统计参数

innodb_stats_persistent, innodb_stats_auto_recalc, and innodb_stats_persistent_sample_pages are global configuration options. To override these system-wide settings and configure optimizer statistics parameters for individual tables, you can define STATS_PERSISTENT, STATS_AUTO_RECALC, and STATS_SAMPLE_PAGES clauses in CREATE TABLE or ALTER TABLE statements.

  • STATS_PERSISTENT specifies whether to enable persistent statistics for an InnoDB table. The value DEFAULT causes the persistent statistics setting for the table to be determined by the innodb_stats_persistent configuration option. The value 1 enables persistent statistics for the table, while the value 0 turns off this feature. After enabling persistent statistics through a CREATE TABLE or ALTER TABLE statement, issue an ANALYZE TABLE statement to calculate the statistics, after loading representative data into the table.

  • STATS_AUTO_RECALC specifies whether to automatically recalculate persistent statistics for an InnoDB table. The value DEFAULT causes the persistent statistics setting for the table to be determined by the innodb_stats_auto_recalc configuration option. The value 1 causes statistics to be recalculated when 10% of the data in the table has changed. The value 0 prevents automatic recalculation for this table; with this setting, issue an ANALYZE TABLE statement to recalculate the statistics after making substantial changes to the table.

  • STATS_SAMPLE_PAGES specifies the int of index pages to sample when estimating cardinality and other statistics for an indexed column, such as those calculated by ANALYZE TABLE.

All three clauses are specified in the following CREATE TABLE example:

CREATE TABLE `t1` (
`id` int(8) NOT NULL auto_increment,
`data` varchar(255),
`date` datetime,
PRIMARY KEY  (`id`),
INDEX `DATE_IX` (`date`)
) ENGINE=InnoDB,
  STATS_PERSISTENT=1,
  STATS_AUTO_RECALC=1,
STATS_SAMPLE_PAGES=25;
15.6.11.1.3 Configuring the int of Sampled Pages for InnoDB Optimizer Statistics

The MySQL query optimizer uses estimated statistics about key distributions to choose the indexes for an execution plan, based on the relative selectivity of the index. Operations such as ANALYZE TABLE cause InnoDB to sample random pages from each index on a table to estimate the cardinality of the index. (This technique is known as random dives.)

To give you control over the quality of the statistics estimate (and thus better information for the query optimizer), you can change the int of sampled pages using the parameter innodb_stats_persistent_sample_pages, which can be set at runtime.

innodb_stats_persistent_sample_pages has a default value of 20. As a general guideline, consider modifying this parameter when encountering the following issues:

  1. Statistics are not accurate enough and the optimizer chooses suboptimal plans, as shown by EXPLAIN output. The accuracy of statistics can be checked by comparing the actual cardinality of an index (as returned by running SELECT DISTINCT on the index columns) with the estimates provided in the mysql.innodb_index_stats persistent statistics table.

    If it is determined that statistics are not accurate enough, the value of innodb_stats_persistent_sample_pages should be increased until the statistics estimates are sufficiently accurate. Increasing innodb_stats_persistent_sample_pages too much, however, could cause ANALYZE TABLE to run slowly.

  2. ANALYZE TABLE is too slow. In this case innodb_stats_persistent_sample_pages should be decreased until ANALYZE TABLE execution time is acceptable. Decreasing the value too much, however, could lead to the first problem of inaccurate statistics and suboptimal query execution plans.

    If a balance cannot be achieved between accurate statistics and ANALYZE TABLE execution time, consider decreasing the int of indexed columns in the table or limiting the int of partitions to reduce ANALYZE TABLE complexity. The int of columns in the table's primary key is also important to consider, as primary key columns are appended to each non-unique index.

    For related information, see 15.6.11.3 节, “Estimating ANALYZE TABLE Complexity for InnoDB Tables”.

15.6.11.1.4 Including Delete-marked Records in Persistent Statistics Calculations

默认情况下, InnoDB reads uncommitted data when calculating statistics. In the case of an uncommitted transaction that deletes rows from a table, InnoDB excludes records that are delete-marked when calculating row estimates and index statistics, which can lead to non-optimal execution plans for other transactions that are operating on the table concurrently using a transaction isolation level other than READ UNCOMMITTED. To avoid this scenario, innodb_stats_include_delete_marked can be enabled to ensure that InnoDB includes delete-marked records when calculating persistent optimizer statistics.

When innodb_stats_include_delete_marked is enabled, ANALYZE TABLE considers delete-marked records when recalculating statistics.

innodb_stats_include_delete_marked is a global setting that affects all InnoDB tables, and it is only applicable to persistent optimizer statistics.

15.6.11.1.5 InnoDB Persistent Statistics Tables

The persistent statistics feature relies on the internally managed tables in the mysql database, named innodb_table_stats and innodb_index_stats. These tables are set up automatically in all install, upgrade, and build-from-source procedures.

Table 15.3 Columns of innodb_table_stats

列名描述
database_nameDatabase name
table_nameTable name, partition name, or subpartition name
last_updateA timestamp indicating the last time that InnoDB updated this row
n_rowsThe int of rows in the table
clustered_index_sizeThe size of the primary index, in pages
sum_of_other_index_sizesThe total size of other (non-primary) indexes, in pages

表15.4 innodb_index_stats的列

列名描述
database_name库名
table_name表名、分区名或辅助分区名
index_name索引名
last_update最后一次行更新的时间点
stat_nameThe name of the statistic, whose value is reported in the stat_value column
stat_valueThe value of the statistic that is named in stat_name column
sample_sizeThe int of pages sampled for the estimate provided in the stat_value column
stat_descriptionDescription of the statistic that is named in the stat_name column

Both the innodb_table_stats and innodb_index_stats tables include a last_update column showing when InnoDB last updated index statistics, as shown in the following example:

mysql> select * from innodb_table_stats \G
*************************** 1. row ***************************
           database_name: sakila
              table_name: actor
             last_update: 2014-05-28 16:16:44
                  n_rows: 200
    clustered_index_size: 1
sum_of_other_index_sizes: 1
...
mysql> select * from innodb_index_stats \G
*************************** 1. row ***************************
   database_name: sakila
      table_name: actor
      index_name: PRIMARY
     last_update: 2014-05-28 16:16:44
       stat_name: n_diff_pfx01
      stat_value: 200
     sample_size: 1
     ...

The innodb_table_stats and innodb_index_stats tables are ordinary tables and can be updated manually. The ability to update statistics manually makes it possible to force a specific query optimization plan or test alternative plans without modifying the database. If you manually update statistics, issue the FLUSH TABLE tbl_name command to make MySQL reload the updated statistics.

15.6.11.1.6 InnoDB Persistent Statistics Tables Example

The innodb_table_stats table contains one row per table. The data collected is demonstrated in the following example.

Table t1 contains a primary index (columns a, b) secondary index (columns c, d), and unique index (columns e, f):

CREATE TABLE t1 (
a INT, b INT, c INT, d INT, e INT, f INT,
PRIMARY KEY (a, b), KEY i1 (c, d), UNIQUE KEY i2uniq (e, f)
) ENGINE=INNODB;

After inserting five rows of sample data, the table appears as follows:

mysql> SELECT * FROM t1;
+---+---+------+------+------+------+
| a | b | c    | d    | e    | f    |
+---+---+------+------+------+------+
| 1 | 1 |   10 |   11 |  100 |  101 |
| 1 | 2 |   10 |   11 |  200 |  102 |
| 1 | 3 |   10 |   11 |  100 |  103 |
| 1 | 4 |   10 |   12 |  200 |  104 |
| 1 | 5 |   10 |   12 |  100 |  105 |
+---+---+------+------+------+------+

To immediately update statistics, run ANALYZE TABLE (if innodb_stats_auto_recalc is enabled, statistics are updated automatically within a few seconds assuming that the 10% threshold for changed table rows is reached):

mysql> ANALYZE TABLE t1;
+---------+---------+----------+----------+
| Table   | Op      | Msg_type | Msg_text |
+---------+---------+----------+----------+
| test.t1 | analyze | status   | OK       |
+---------+---------+----------+----------+

Table statistics for table t1 show the last time InnoDB updated the table statistics (2014-03-14 14:36:34), the int of rows in the table (5), the clustered index size (1 page), and the combined size of the other indexes (2 pages).

mysql> SELECT * FROM mysql.innodb_table_stats WHERE table_name like 't1'\G
*************************** 1. row ***************************
           database_name: test
              table_name: t1
             last_update: 2014-03-14 14:36:34
                  n_rows: 5
    clustered_index_size: 1
sum_of_other_index_sizes: 2

The innodb_index_stats table contains multiple rows for each index. Each row in the innodb_index_stats table provides data related to a particular index statistic which is named in the stat_name column and described in the stat_description column. For example:

mysql> SELECT index_name, stat_name, stat_value, stat_description
    -> FROM mysql.innodb_index_stats WHERE table_name like 't1';
+------------+--------------+------------+-----------------------------------+
| index_name | stat_name    | stat_value | stat_description                  |
+------------+--------------+------------+-----------------------------------+
| PRIMARY    | n_diff_pfx01 |          1 | a                                 |
| PRIMARY    | n_diff_pfx02 |          5 | a,b                               |
| PRIMARY    | n_leaf_pages |          1 | int of leaf pages in the index |
| PRIMARY    | size         |          1 | int of pages in the index      |
| i1         | n_diff_pfx01 |          1 | c                                 |
| i1         | n_diff_pfx02 |          2 | c,d                               |
| i1         | n_diff_pfx03 |          2 | c,d,a                             |
| i1         | n_diff_pfx04 |          5 | c,d,a,b                           |
| i1         | n_leaf_pages |          1 | int of leaf pages in the index |
| i1         | size         |          1 | int of pages in the index      |
| i2uniq     | n_diff_pfx01 |          2 | e                                 |
| i2uniq     | n_diff_pfx02 |          5 | e,f                               |
| i2uniq     | n_leaf_pages |          1 | int of leaf pages in the index |
| i2uniq     | size         |          1 | int of pages in the index      |
+------------+--------------+------------+-----------------------------------+

The stat_name column shows the following types of statistics:

  • size: Where stat_name=size, the stat_value column displays the total int of pages in the index.

  • n_leaf_pages: Where stat_name=n_leaf_pages, the stat_value column displays the int of leaf pages in the index.

  • n_diff_pfxNN: Where stat_name=n_diff_pfx01, the stat_value column displays the int of distinct values in the first column of the index. Where stat_name=n_diff_pfx02, the stat_value column displays the int of distinct values in the first two columns of the index, and so on. Additionally, where stat_name=n_diff_pfxNN, the stat_description column shows a comma separated list of the index columns that are counted.

To further illustrate the n_diff_pfxNN statistic, which provides cardinality data, consider the t1 table example. As shown below, the t1 table is created with a primary index (columns a, b), a secondary index (columns c, d), and a unique index (columns e, f):

CREATE TABLE t1 (
  a INT, b INT, c INT, d INT, e INT, f INT,
  PRIMARY KEY (a, b), KEY i1 (c, d), UNIQUE KEY i2uniq (e, f)
) ENGINE=INNODB;

After inserting five rows of sample data, the table appears as follows:

mysql> SELECT * FROM t1;
+---+---+------+------+------+------+
| a | b | c    | d    | e    | f    |
+---+---+------+------+------+------+
| 1 | 1 |   10 |   11 |  100 |  101 |
| 1 | 2 |   10 |   11 |  200 |  102 |
| 1 | 3 |   10 |   11 |  100 |  103 |
| 1 | 4 |   10 |   12 |  200 |  104 |
| 1 | 5 |   10 |   12 |  100 |  105 |
+---+---+------+------+------+------+

When you query the index_name, stat_name, stat_value, and stat_description where stat_name LIKE 'n_diff%', the following result set is returned:

mysql> SELECT index_name, stat_name, stat_value, stat_description
    -> FROM mysql.innodb_index_stats
    -> WHERE table_name like 't1' AND stat_name LIKE 'n_diff%';
+------------+--------------+------------+------------------+
| index_name | stat_name    | stat_value | stat_description |
+------------+--------------+------------+------------------+
| PRIMARY    | n_diff_pfx01 |          1 | a                |
| PRIMARY    | n_diff_pfx02 |          5 | a,b              |
| i1         | n_diff_pfx01 |          1 | c                |
| i1         | n_diff_pfx02 |          2 | c,d              |
| i1         | n_diff_pfx03 |          2 | c,d,a            |
| i1         | n_diff_pfx04 |          5 | c,d,a,b          |
| i2uniq     | n_diff_pfx01 |          2 | e                |
| i2uniq     | n_diff_pfx02 |          5 | e,f              |
+------------+--------------+------------+------------------+

For the PRIMARY index, there are two n_diff% rows. The int of rows is equal to the int of columns in the index.

Note

For non-unique indexes, InnoDB appends the columns of the primary key.

  • Where index_name=PRIMARY and stat_name=n_diff_pfx01, the stat_value is 1, which indicates that there is a single distinct value in the first column of the index (column a). The int of distinct values in column a is confirmed by viewing the data in column a in table t1, in which there is a single distinct value (1). The counted column (a) is shown in the stat_description column of the result set.

  • Where index_name=PRIMARY and stat_name=n_diff_pfx02, the stat_value is 5, which indicates that there are five distinct values in the two columns of the index (a,b). The int of distinct values in columns a and b is confirmed by viewing the data in columns a and b in table t1, in which there are five distinct values: (1,1), (1,2), (1,3), (1,4) and (1,5). The counted columns (a,b) are shown in the stat_description column of the result set.

For the secondary index (i1), there are four n_diff% rows. Only two columns are defined for the secondary index (c,d) but there are four n_diff% rows for the secondary index because InnoDB suffixes all non-unique indexes with the primary key. As a result, there are four n_diff% rows instead of two to account for the both the secondary index columns (c,d) and the primary key columns (a,b).

  • Where index_name=i1 and stat_name=n_diff_pfx01, the stat_value is 1, which indicates that there is a single distinct value in the first column of the index (column c). The int of distinct values in column c is confirmed by viewing the data in column c in table t1, in which there is a single distinct value: (10). The counted column (c) is shown in the stat_description column of the result set.

  • Where index_name=i1 and stat_name=n_diff_pfx02, the stat_value is 2, which indicates that there are two distinct values in the first two columns of the index (c,d). The int of distinct values in columns c an d is confirmed by viewing the data in columns c and d in table t1, in which there are two distinct values: (10,11) and (10,12). The counted columns (c,d) are shown in the stat_description column of the result set.

  • Where index_name=i1 and stat_name=n_diff_pfx03, the stat_value is 2, which indicates that there are two distinct values in the first three columns of the index (c,d,a). The int of distinct values in columns c, d, and a is confirmed by viewing the data in column c, d, and a in table t1, in which there are two distinct values: (10,11,1) and (10,12,1). The counted columns (c,d,a) are shown in the stat_description column of the result set.

  • Where index_name=i1 and stat_name=n_diff_pfx04, the stat_value is 5, which indicates that there are five distinct values in the four columns of the index (c,d,a,b). The int of distinct values in columns c, d, a and b is confirmed by viewing the data in columns c, d, a, and b in table t1, in which there are five distinct values: (10,11,1,1), (10,11,1,2), (10,11,1,3), (10,12,1,4) and (10,12,1,5). The counted columns (c,d,a,b) are shown in the stat_description column of the result set.

For the unique index (i2uniq), there are two n_diff% rows.

  • Where index_name=i2uniq and stat_name=n_diff_pfx01, the stat_value is 2, which indicates that there are two distinct values in the first column of the index (column e). The int of distinct values in column e is confirmed by viewing the data in column e in table t1, in which there are two distinct values: (100) and (200). The counted column (e) is shown in the stat_description column of the result set.

  • Where index_name=i2uniq and stat_name=n_diff_pfx02, the stat_value is 5, which indicates that there are five distinct values in the two columns of the index (e,f). The int of distinct values in columns e and f is confirmed by viewing the data in columns e and f in table t1, in which there are five distinct values: (100,101), (200,102), (100,103), (200,104) and (100,105). The counted columns (e,f) are shown in the stat_description column of the result set.

15.6.11.1.7 Retrieving Index Size Using the innodb_index_stats Table

The size of indexes for tables, partitions, or subpartitions can be retrieved using the innodb_index_stats table. In the following example, index sizes are retrieved for table t1. For a definition of table t1 and corresponding index statistics, see 15.6.11.1.6 节, “InnoDB Persistent Statistics Tables Example”.

mysql> SELECT SUM(stat_value) pages, index_name,
    -> SUM(stat_value)*@@innodb_page_size size
    -> FROM mysql.innodb_index_stats WHERE table_name='t1'
    -> AND stat_name = 'size' GROUP BY index_name;
+-------+------------+-------+
| pages | index_name | size  |
+-------+------------+-------+
|     1 | PRIMARY    | 16384 |
|     1 | i1         | 16384 |
|     1 | i2uniq     | 16384 |
+-------+------------+-------+

For partitions or subpartitions, the same query with a modified WHERE clause can be used to retrieve index sizes. For example, the following query retrieves index sizes for partitions of table t1:

mysql> SELECT SUM(stat_value) pages, index_name,
    -> SUM(stat_value)*@@innodb_page_size size
    -> FROM mysql.innodb_index_stats WHERE table_name like 't1#P%'
-> AND stat_name = 'size' GROUP BY index_name;     

15.6.11.2 Configuring Non-Persistent Optimizer Statistics Parameters

This section describes how to configure non-persistent optimizer statistics. Optimizer statistics are not persisted to disk when innodb_stats_persistent=OFF or when individual tables are created or altered with STATS_PERSISTENT=0. Instead, statistics are stored in memory, and are lost when the server is shut down. Statistics are also updated periodically by certain operations and under certain conditions.

Optimizer statistics are persisted to disk 默认情况下, enabled by the innodb_stats_persistent configuration option. For information about persistent optimizer statistics, see 15.6.11.1, “Configuring Persistent Optimizer Statistics Parameters”.

Optimizer Statistics Updates

Non-persistent optimizer statistics are updated when:

Configuring the int of Sampled Pages

The MySQL query optimizer uses estimated statistics about key distributions to choose the indexes for an execution plan, based on the relative selectivity of the index. When InnoDB updates optimizer statistics, it samples random pages from each index on a table to estimate the cardinality of the index. (This technique is known as random dives.)

To give you control over the quality of the statistics estimate (and thus better information for the query optimizer), you can change the int of sampled pages using the parameter innodb_stats_transient_sample_pages. The default int of sampled pages is 8, which could be insufficient to produce an accurate estimate, leading to poor index choices by the query optimizer. This technique is especially important for large tables and tables used in joins. Unnecessary full table scans for such tables can be a substantial performance issue. See 8.2.1.20, “Avoiding Full Table Scans” for tips on tuning such queries. innodb_stats_transient_sample_pages is a global parameter that can be set at runtime.

The value of innodb_stats_transient_sample_pages affects the index sampling for all InnoDB tables and indexes when innodb_stats_persistent=0. Be aware of the following potentially significant impacts when you change the index sample size:

  • Small values like 1 or 2 can result in inaccurate estimates of cardinality.

  • Increasing the innodb_stats_transient_sample_pages value might require more disk reads. Values much larger than 8 (say, 100), can cause a significant slowdown in the time it takes to open a table or execute SHOW TABLE STATUS.

  • The optimizer might choose very different query plans based on different estimates of index selectivity.

Whatever value of innodb_stats_transient_sample_pages works best for a system, set the option and leave it at that value. Choose a value that results in reasonably accurate estimates for all tables in your database without requiring excessive I/O. Because the statistics are automatically recalculated at various times other than on execution of ANALYZE TABLE, it does not make sense to increase the index sample size, run ANALYZE TABLE, then decrease sample size again.

Smaller tables generally require fewer index samples than larger tables. If your database has many large tables, consider using a higher value for innodb_stats_transient_sample_pages than if you have mostly smaller tables.

15.6.11.3 Estimating ANALYZE TABLE Complexity for InnoDB Tables

ANALYZE TABLE complexity for InnoDB tables is dependent on:

  • The int of pages sampled, as defined by innodb_stats_persistent_sample_pages.

  • The int of indexed columns in a table

  • The int of partitions. If a table has no partitions, the int of partitions is considered to be 1.

Using these parameters, an approximate formula for estimating ANALYZE TABLE complexity would be:

The value of innodb_stats_persistent_sample_pages * int of indexed columns in a table * the int of partitions

Typically, the greater the resulting value, the greater the execution time for ANALYZE TABLE.

Note

innodb_stats_persistent_sample_pages defines the int of pages sampled at a global level. To set the int of pages sampled for an individual table, use the STATS_SAMPLE_PAGES option with CREATE TABLE or ALTER TABLE. For more information, see 15.6.11.1, “Configuring Persistent Optimizer Statistics Parameters”.

If innodb_stats_persistent=OFF, the int of pages sampled is defined by innodb_stats_transient_sample_pages. See 15.6.11.2 节, “Configuring Non-Persistent Optimizer Statistics Parameters” for additional information.

For a more in-depth approach to estimating ANALYZE TABLE complexity, consider the following example.

In Big O notation, ANALYZE TABLE complexity is described as:

 O(n_sample
  * (n_cols_in_uniq_i
     + n_cols_in_non_uniq_i
     + n_cols_in_pk * (1 + n_non_uniq_i))
  * n_part)          

where:

  • n_sample is the int of pages sampled (defined by innodb_stats_persistent_sample_pages)

  • n_cols_in_uniq_i is total int of all columns in all unique indexes (not counting the primary key columns)

  • n_cols_in_non_uniq_i is the total int of all columns in all non-unique indexes

  • n_cols_in_pk is the int of columns in the primary key (if a primary key is not defined, InnoDB creates a single column primary key internally)

  • n_non_uniq_i is the int of non-unique indexes in the table

  • n_part is the int of partitions. If no partitions are defined, the table is considered to be a single partition.

Now, consider the following table (table t), which has a primary key (2 columns), a unique index (2 columns), and two non-unique indexes (two columns each):

 CREATE TABLE t (
  a INT,
  b INT,
  c INT,
  d INT,
  e INT,
  f INT,
  g INT,
  h INT,
  PRIMARY KEY (a, b),
  UNIQUE KEY i1uniq (c, d),
  KEY i2nonuniq (e, f),
  KEY i3nonuniq (g, h)
);    

For the column and index data required by the algorithm described above, query the mysql.innodb_index_stats persistent index statistics table for table t. The n_diff_pfx% statistics show the columns that are counted for each index. For example, columns a and b are counted for the primary key index. For the non-unique indexes, the primary key columns (a,b) are counted in addition to the user defined columns.

Note

For additional information about the InnoDB persistent statistics tables, see 15.6.11.1, “Configuring Persistent Optimizer Statistics Parameters”

  SELECT index_name, stat_name, stat_description
  FROM mysql.innodb_index_stats
  WHERE
  database_name='test' AND
  table_name='t' AND
  stat_name like 'n_diff_pfx%';

  +------------+--------------+------------------+
  | index_name | stat_name    | stat_description |
  +------------+--------------+------------------+
  | PRIMARY    | n_diff_pfx01 | a                |
  | PRIMARY    | n_diff_pfx02 | a,b              |
  | i1uniq     | n_diff_pfx01 | c                |
  | i1uniq     | n_diff_pfx02 | c,d              |
  | i2nonuniq  | n_diff_pfx01 | e                |
  | i2nonuniq  | n_diff_pfx02 | e,f              |
  | i2nonuniq  | n_diff_pfx03 | e,f,a            |
  | i2nonuniq  | n_diff_pfx04 | e,f,a,b          |
  | i3nonuniq  | n_diff_pfx01 | g                |
  | i3nonuniq  | n_diff_pfx02 | g,h              |
  | i3nonuniq  | n_diff_pfx03 | g,h,a            |
  | i3nonuniq  | n_diff_pfx04 | g,h,a,b          |
  +------------+--------------+------------------+   

Based on the index statistics data shown above and the table definition, the following values can be determined:

  • n_cols_in_uniq_i, the total int of all columns in all unique indexes not counting the primary key columns, is 2 (c and d)

  • n_cols_in_non_uniq_i, the total int of all columns in all non-unique indexes, is 4 (e, f, g and h)

  • n_cols_in_pk, the int of columns in the primary key, is 2 (a and b)

  • n_non_uniq_i, the int of non-unique indexes in the table, is 2 (i2nonuniq and i3nonuniq))

  • n_part, the int of partitions, is 1.

You can now calculate innodb_stats_persistent_sample_pages * (2 + 4 + 2 * (1 + 2)) * 1 to determine the int of leaf pages that are scanned. With innodb_stats_persistent_sample_pages set to the default value of 20, and with a default page size of 16 KiB (innodb_page_size=16384), you can then estimate that 20 * 12 * 16384 bytes are read for table t, or about 4 MiB.

Note

All 4 MiB may not be read from disk, as some leaf pages may already be cached in the buffer pool.

15.6.12 索引页合并阈值配置

You can configure the MERGE_THRESHOLD value for index pages. If the page-full percentage for an index page falls below the MERGE_THRESHOLD value when a row is deleted or when a row is shortened by an UPDATE operation, InnoDB attempts to merge the index page with a neighboring index page. The default MERGE_THRESHOLD value is 50, which is the previously hardcoded value. The minimum MERGE_THRESHOLD value is 1 and the maximum value is 50.

When the page-full percentage for an index page falls below 50%, which is the default MERGE_THRESHOLD setting, InnoDB attempts to merge the index page with a neighboring page. If both pages are close to 50% full, a page split can occur soon after the pages are merged. If this merge-split behavior occurs frequently, it can have an adverse affect on performance. To avoid frequent merge-splits, you can lower the MERGE_THRESHOLD value so that InnoDB attempts page merges at a lower page-full percentage. Merging pages at a lower page-full percentage leaves more room in index pages and helps reduce merge-split behavior.

The MERGE_THRESHOLD for index pages can be defined for a table or for individual indexes. A MERGE_THRESHOLD value defined for an individual index takes priority over a MERGE_THRESHOLD value defined for the table. If undefined, the MERGE_THRESHOLD value defaults to 50.

Setting MERGE_THRESHOLD for a Table

You can set the MERGE_THRESHOLD value for a table using the table_option COMMENT clause of the CREATE TABLE statement. For example:

CREATE TABLE t1 (
   id INT,
  KEY id_index (id)
) COMMENT='MERGE_THRESHOLD=45';

You can also set the MERGE_THRESHOLD value for an existing table using the table_option COMMENT clause with ALTER TABLE:

CREATE TABLE t1 (
   id INT,
  KEY id_index (id)
);

ALTER TABLE t1 COMMENT='MERGE_THRESHOLD=40';    

Setting MERGE_THRESHOLD for Individual Indexes

To set the MERGE_THRESHOLD value for an individual index, you can use the index_option COMMENT clause with CREATE TABLE, ALTER TABLE, or CREATE INDEX, as shown in the following examples:

  • Setting MERGE_THRESHOLD for an individual index using CREATE TABLE:

    CREATE TABLE t1 (
       id INT,
      KEY id_index (id) COMMENT 'MERGE_THRESHOLD=40'
    );
  • Setting MERGE_THRESHOLD for an individual index using ALTER TABLE:

    CREATE TABLE t1 (
       id INT,
      KEY id_index (id)
    );
    
    ALTER TABLE t1 DROP KEY id_index;
    ALTER TABLE t1 ADD KEY id_index (id) COMMENT 'MERGE_THRESHOLD=40';
  • Setting MERGE_THRESHOLD for an individual index using CREATE INDEX:

    CREATE TABLE t1 (id INT);
    CREATE INDEX id_index ON t1 (id) COMMENT 'MERGE_THRESHOLD=40';
Note

You cannot modify the MERGE_THRESHOLD value at the index level for GEN_CLUST_INDEX, which is the clustered index created by InnoDB when an InnoDB table is created without a primary key or unique key index. You can only modify the MERGE_THRESHOLD value for GEN_CLUST_INDEX by setting MERGE_THRESHOLD for the table.

Querying the MERGE_THRESHOLD Value for an Index

The current MERGE_THRESHOLD value for an index can be obtained by querying the INNODB_SYS_INDEXES table. For example:

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_INDEXES WHERE NAME='id_index' \G
*************************** 1. row ***************************
       INDEX_ID: 91
           NAME: id_index
       TABLE_ID: 68
           TYPE: 0
       N_FIELDS: 1
        PAGE_NO: 4
          SPACE: 57
MERGE_THRESHOLD: 40

You can use SHOW CREATE TABLE to view the MERGE_THRESHOLD value for a table, if explicitly defined using the table_option COMMENT clause:

mysql> SHOW CREATE TABLE t2 \G
*************************** 1. row ***************************
       Table: t2
Create Table: CREATE TABLE `t2` (
  `id` int(11) DEFAULT NULL,
  KEY `id_index` (`id`) COMMENT 'MERGE_THRESHOLD=40'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
Note

A MERGE_THRESHOLD value defined at the index level takes priority over a MERGE_THRESHOLD value defined for the table. If undefined, MERGE_THRESHOLD defaults to 50% (MERGE_THRESHOLD=50, which is the previously hardcoded value.

Likewise, you can use SHOW INDEX to view the MERGE_THRESHOLD value for an index, if explicitly defined using the index_option COMMENT clause:

mysql> SHOW INDEX FROM t2 \G
*************************** 1. row ***************************
        Table: t2
   Non_unique: 1
     Key_name: id_index
 Seq_in_index: 1
  Column_name: id
    Collation: A
  Cardinality: 0
     Sub_part: NULL
       Packed: NULL
         Null: YES
   Index_type: BTREE
      Comment:
Index_comment: MERGE_THRESHOLD=40

Measuring the Effect of MERGE_THRESHOLD Settings

The INNODB_METRICS table provides two counters that can be used to measure the effect of a MERGE_THRESHOLD setting on index page merges.

mysql> SELECT NAME, COMMENT FROM INFORMATION_SCHEMA.INNODB_METRICS
WHERE NAME like '%index_page_merge%';
+-----------------------------+----------------------------------------+
| NAME                        | COMMENT                                |
+-----------------------------+----------------------------------------+
| index_page_merge_attempts   | int of index page merge attempts    |
| index_page_merge_successful | int of successful index page merges |
+-----------------------------+----------------------------------------+

When lowering the MERGE_THRESHOLD value, the objectives are:

  • A smaller int of page merge attempts and successful page merges

  • A similar int of page merge attempts and successful page merges

MERGE_THRESHOLD 设置的太小,会导致大数据文件占用过多的空页空间。

关于INNODB_METRICS 计算器的信息,参阅 15.14.6 节, “InnoDB INFORMATION_SCHEMA 度量表”

15.7 InnoDB 表空间

本节讨论与InnoDB表空间相关的主题。

15.7.1 调整InnoDB系统表空间大小

本节描述如何增大或减小InnoDB系统表空间的大小。

增大InnoDB系统表空间的大小

增大InnoDB系统表空间的最简单方法就是将其配置为自动扩展。在表空间定义中的最后一个数据文件指定 autoextend 属性。然后空间被耗尽时InnoDB会以大小为64MB作为增长量自动增长。增长量的值可以通过系统变量 innodb_autoextend_increment 调整,单位是MB。

您可以通过添加另一个数据文件来扩展系统表空间:

  1. 关闭MySQL服务器。

  2. 如果之前的数据文件已经被定义为了 autoextend,那么根据其实际增长的大小,为其设置一个固定大小。改变它的定义,使用固定的大小。检查数据文件的大小,换算为以MB为单位的近似值,然后在 innodb_data_file_path中指定。

  3. innodb_data_file_path后面添加一个新的数据文件,可以置为自动扩展。但是仅 innodb_data_file_path 的最后一个数据文件可以被指定为自动扩展。

  4. 重新启动MySQL服务器。

举个例子,表空间只有一个自动扩展的数据文件ibdata1:

innodb_data_home_dir =
innodb_data_file_path = /ibdata/ibdata1:10M:autoextend

假设,这个数据文件,经过一段时间,现在增长到了988MB。下面的配置行,是将原数据文件用固定值配置,并且添加了一个新的自动扩展的数据文件:

innodb_data_home_dir =
innodb_data_file_path = /ibdata/ibdata1:988M;/disk2/ibdata2:50M:autoextend

当您为系统表空添加新的数据文件时,请确保新的数据文件在相应的目录下是不存在的。当您重启服务器时,InnoDB会创建并初始化这个文件。

减小InnoDB系统表空间

不能直接从系统表空间中移除一个数据文件。如果为了降低系统表空间的大小,可以使用下面的步骤:

  1. 使用 mysqldump dump所有的 InnoDB表,包括mysql库下的 InnoDB 表。

    mysql> select table_name from information_schema.tables where table_schema='mysql' and engine='InnoDB';
    +---------------------------+
    | table_name                |
    +---------------------------+
    | columns_priv              |
    | db                        |
    | engine_cost               |
    | gtid_executed             |
    | help_category             |
    | help_keyword              |
    | help_relation             |
    | help_topic                |
    | innodb_index_stats        |
    | innodb_table_stats        |
    | plugin                    |
    | procs_priv                |
    | proxies_priv              |
    | server_cost               |
    | servers                   |
    | slave_master_info         |
    | slave_relay_log_info      |
    | slave_worker_info         |
    | tables_priv               |
    | time_zone                 |
    | time_zone_leap_second     |
    | time_zone_name            |
    | time_zone_transition      |
    | time_zone_transition_type |
    | user                      |
    +---------------------------+
  2. 关闭服务器。

  3. 移除所有存在的表空间文件 (*.ibd),包括 ibdataib_log文件。不要忘记移除MySQL数据库中表的 *.ibd问。

  4. 配置一个新的表空

  5. 重启服务器。

  6. 导入dump文件

注意

如果您的数据库中仅使用了 InnoDB 引擎,就会更简单,dump 全部数据库,关闭服务器,移除所有数据库和 InnoDB 日志文件,重启服务器,然后导入dump文件。

15.7.2 修改InnoDB重做日志文件个数和大小

执行下面步骤来修改InnoDB重做日志的大小和个数:

  1. 确保在没有出错的情况下,关闭了MySQL服务器。

  2. 编辑 my.cnf 修改日志文件的配置。改变日志文件的大小,就配置 innodb_log_file_size.。增加日志文件的个数,就配置 innodb_log_files_in_group

  3. 再次启动MySQL服务器。

如果 InnoDB 发现 innodb_log_file_size与重做日志的大小不同,它会写一个日志检查点,然后,关闭并移除老的日志文件,创建新的符合要求大小的日志文件,再打开这个新的日志文件。

15.7.3 系统表空间使用裸分区

您可以将 InnoDB 系统表空间的数据文件放在裸分区上。 这个技术使得在Windows和一些Linux和Unix系统上实现无缓存的I/O,没有文件系统开销。可以在裸分区上进行测试,来验证这个改变能否提升您系统的性能。

当您使用裸分区时,确保运行MySQL服务器的用户的ID,对这个磁盘分区有读和写的权限。如果您是使用 mysql 用户运行MySQL服务器,那么分区必须对 mysql可读和可写。如果您运行服务器是使用了 the --memlock 选项,那么必须使用 以root用户运行服务器。以便root用户对磁盘是可读和和写的。

下面的描述中,涉及到修改选项文件,对于选项文件的信息,请参阅 4.2.6 节, “使用选项文件”

在Linux和Unix系统上配置一个裸磁盘分区

  1. 当您在创建新数据文件时,在innodb_data_file_path选项后,紧挨着数据文件大胸的后面指定关键字newraw。注意,在InnoDB中1MB就是1024 × 1024 字节,而且1MB 在磁盘上转换后通常是 1,000,000 字节。

    [mysqld]
    innodb_data_home_dir=
    innodb_data_file_path=/dev/hdd1:3Gnewraw;/dev/hdd2:2Gnewraw
    
  2. 重启服务器。 InnoDB 会注意到 newraw 关键字,然后初始化一个新的磁盘分区。但是,不要创建或修改任何表,否则,当您下次重启服务器后, InnoDB 会重新初始化分区,并且您所做的修改会丢失。(一个安全的做法就是,当分区上指定了newrawInnoDB将阻止用户修改数据。)

  3. InnoDB 被初始化完毕后,关闭服务器,将 newraw修改指定为 raw:

    [mysqld]
    innodb_data_home_dir=
    innodb_data_file_path=/dev/hdd1:3Graw;/dev/hdd2:2Graw
    
  4. 重启服务器。现在InnoDB 就可以进行修改。

在Windows上配置裸磁盘分区

在 Windows 系统上,除了 innodb_data_file_path 的设置,其他的配置步骤与在Linux和Unix上的一样。

  1. 当您在创建新数据文件时,在innodb_data_file_path选项后,紧挨着数据文件大胸的后面指定关键字newraw

    [mysqld]
    innodb_data_home_dir=
    innodb_data_file_path=//./D::10Gnewraw
    

    上面的 //./ 相当于Windows中访问物理驱动的语法 \\.\。上面示例中的, drives. In the example above, D: 就是分区的盘符。

  2. 重启服务器。 InnoDB 会注意到 newraw 关键字,然后初始化一个新的磁盘分区。

  3. InnoDB 被初始化完毕后,关闭服务器,将 newraw修改指定为 raw:

    [mysqld]
    innodb_data_home_dir=
    innodb_data_file_path=//./D::10Graw
    
  4. 重启服务器。现在InnoDB 就修改好了。

15.7.4 InnoDB 每个表独立表空间

历史版本中,所有的InnoDB表和索引都是被储存在 系统表空间。这种单一的方法是针对数据库专用服务器,精准的预算了数据增长,而且磁盘只分配给MySQL使用。 InnoDB表独立数据表空间 特性提供了更灵活的选择。 InnoDB的表和表的索引都是储存在独立的 .ibd 数据文件中,每个 .ibd 数据文件都代表着一个独立的 表空间。这个特性是由配置选项 innodb_file_per_table 控制的。而且默认是启用的。

独立表空间的优点

  • 您可以截断或删除使用独立表空间存储的表,来回收磁盘空间。对储存在 系统表空间 的表进行截断或删掉。会在系统表空间数据文件内存创建一个空闲可用空间,这些空间只能被新的InnoDB表使用。

    同样的, 当对保存在共享表空间的表进行ALTER TABLE操作时,会增加表空间的使用量。这样的操作可能额外需要跟表中的数据以及索引一样多的空间。而由 ALTER TABLE 操作产生的额外空间不能被释放归还给操作系统。

  • 当表是使用独立表空间存储时,TRUNCATE TABLE操作更快。

  • 为了优化I/O、空间管理或备份,在创建每个表时,使用语法 CREATE TABLE ... DATA DIRECTORY = absolute_path_to_directory来指定表的位置, 这样您可以将特殊的包存储在不同的存储设备上。 正如 15.7.5 节, “数据目录之外创建表文件独立表空间”中解释的。

  • 您可以运行OPTIMIZE TABLE 来压缩或重建一个独立表空间。当您运行一个 OPTIMIZE TABLEInnoDB用一个临时名字创建一个新的 .ibd 文件,只使用数据实际的空间大小。当优化操作完成, InnoDB 移除老的.ibd 文件。并用新的替换它。如果之前的.ibd文件增长到非常大,但是真正的数据却只占用了一部分。那么运行OPTIMIZE TABLE可以回收那些未使用的空间。

  • 您可以移动单个InnoDB表,而不是整个库。

  • 您可以将独立的InnoDB表,从一个实例复制到另一个实例中(大家所熟知的 传输表空间 特性)。

  • 使用独立表空间的表,还支持 压缩动态 行格式相关联的功能特性。

  • 您使用动态行格式来对大BLOBTEXT 字段进行更有效的存储。

  • 当服务器不能重新启动时,或者备份和二进制日志不可用。当出现损坏时,独立表空间可以提高成功恢复的机会,且节省恢复的时间。

  • 独立表空间在复制或备份表时,表空间对于每个表的状态报告都很方便。

  • 您可以在文件系统级别监视表的大小,而不需要访问MySQL。

  • innodb_flush_method被设置为 O_DIRECT时,一般Linux文件系统就不允许并发写入到单个文件。因此,当与 innodb_flush_method结合,使用独立表空间时会有性能的提升。

  • 储存数据字典和undo日志的系统表空间的大小限制是由 InnoDB 表空间的大小限制。参阅 15.8.1.7 节, “InnoDB表的限制”。 在使用独立表空间时,每个表都有自己的表空间,这样就为增长提供了空间。

表文件独立表空间潜在的缺点

  • 因为表文件都保存在独立的表空间中,所以每个表都有未使用的空间,这样就如果没有适当的管理,就会导致空间浪费。

  • fsync 同步操作必须在每个打开的表上运行,而不是在单个文件。由于每个文件都有独立的 fsync 操作。因此多个表上的写入操作不能合并为单个I/0操作。这样可能需要 InnoDB 执行更多的fsync 操作。

  • mysqld必须保持每个表的一个打开的文件句柄,如果在每个表的表空间中有多个表,这可能会影响性能。

  • 更多的文件描述符。

  • innodb_file_per_table 是默认启用的。如果您考虑到向后与MySQL 5.5或更早版本的兼容问题。您可以使用将其禁用掉。禁用 innodb_file_per_table来阻止 ALTER TABLE操作将 将InnoDB表从系统表空间移动到独立的.ibd文件中。 因为,ALTER TABLE 会重建这个表(ALGORITHM=COPY)。

    举个例子。当调整一个InnoDB表上的聚集索引时,会用当前的设置 innodb_file_per_table重建这个表。这种行为不适用于添加或删除 InnoDB 辅助索引。当一个辅助索引被创建时,不会重建这个表,索引储存在和表数据相同的文件中。这个行为也不适用于使用 CREATE TABLE ... TABLESPACEALTER TABLE ... TABLESPACE 语法向系统表空间添加表,这些表不会受到innodb_file_per_table设置的影响。

  • 如果许多表都在增长,那么就有可能出现更多的碎片,从而阻碍了表和表扫描的性能。 DROP TABLE和表扫描的性能。但是,当时当处理碎片时,它们都是各自拥有表空间又可以提供性能。

  • 当drop一个独立表空间文件的表时,会扫描缓冲池。对于内存很大的缓冲池,这可能需要几秒钟。因为这个扫描是整个范围的执行,而且可能会延迟其他的操作。不影响系统表空间中的表。

  • 变量 innodb_autoextend_increment ,可以控制共享表空间文件用满时,文件自动扩展的大小(MB为单位)。但是这不适用于独立表空间的表,使用独立表空间的表,初始化时的扩展很小,但是后面的扩展增长量都是4MB。

15.7.4.1 启用和禁用表文件独立表空间

innodb_file_per_table选项是默认启用的。

在启动时设置 innodb_file_per_table选项。 命令行加上 --innodb_file_per_table 选项来启动服务器, 或者将这行添加到 my.cnf文件的 [mysqld] 选项下:

[mysqld]
innodb_file_per_table=1

你也可以在服务器运行时,动态的设置 innodb_file_per_table

SET GLOBAL innodb_file_per_table=1;

当启用了 innodb_file_per_table ,您可以在一个 tbl_name.ibd文件中存储InnoDB 表。不像 MyISAM 存储引擎,使用tbl_name.MYDtbl_name.MYI 文件来存储索引和数据。InnoDB 将索引和数据一起储存在单个 .ibd文件中。

如果您在启动选项中,禁用了 innodb_file_per_table,然后重启服务器。或者使用命令 SET GLOBAL 禁用,那么InnoDB创建新表时都会放在系统表空间中,除非您在使用 CREATE TABLE ... TABLESPACE 时,已经明确的指定了使用独立表空间。

不管innodb_file_per_table设置是什么样的,您都可以对任何 InnoDB进行读写操作。

将表从系统表空间移动到自己的表空间中,先修改 innodb_file_per_table设置,然后重建这个表:

SET GLOBAL innodb_file_per_table=1;
ALTER TABLE table_name ENGINE=InnoDB;

使用CREATE TABLE ... TABLESPACEALTER TABLE ... TABLESPACE 语法添加到系统表空间的表,不受 innodb_file_per_table 设置的影响。将这些表从系统表空间移出到独立的表空间,使用 ALTER TABLE ... TABLESPACE 语法时必须明确指定。

注意

InnoDB 一直都是需要系统表空间,因为它把自己内部的 数据字典 and undo日志 放在系统表空间中, 因此,仅有.ibd 文件对于 InnoDB的运作来说还不够。

当一个表从系统表空间中被移动到自己的 own .ibd文件,但是系统表空间的大小还是和原来的一样大。以前被这个表占用的空间,现在可以被其他InnoDB表使用,但是不能被操作系统回收。当从系统表空间中移出一个大表,而磁盘空间又达到了限制,这个时您可能更倾向于启用 innodb_file_per_table ,然后使用 mysqldump命令重建整个实例。正如上面说到的,使用 CREATE TABLE ... TABLESPACEALTER TABLE ... TABLESPACE 语法向系统表空间添加的表,不受 innodb_file_per_table设置的影响,对于这样的表,必须单独移动。

15.7.5 数据目录外创建表独立表空间

使用DATA DIRECTORY = absolute_path_to_directory 作为CREATE TABLE 语句的子句,可以将一个新的使用独立表空间的InnoDB表空间表创建在MySQL数据目录外的一个指定位置。

提前计划好位置,因为ALTER TABLE语句不能加上 DATA DIRECTORY 子句。您可以将目录指定到其它独特性能或容量的存储设备上,如, SSDHDD

在那个指定的目录下,MySQL创建一个子目录,然后在这个子目录中为新表创建 .ibd 文件

下面展示在MySQL数据目录以外的指定目录下创建InnoDB表:

mysql> USE test;
Database changed

mysql> SHOW VARIABLES LIKE 'innodb_file_per_table';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| innodb_file_per_table | ON    |
+-----------------------+-------+
1 row in set (0.00 sec)

mysql> CREATE TABLE t1 (c1 INT PRIMARY KEY) DATA DIRECTORY = '/alternative/directory';
Query OK, 0 rows affected (0.03 sec)

# MySQL creates a .ibd file for the new table in a subdirectory that corresponding
# to the database name

db_user@ubuntu:~/alternative/directory/test$ ls
t1.ibd

您还可以将 CREATE TABLE ... TABLESPACE 联合DATA DIRECTORY 子句,如果这样做的话,您必须指定 innodb_file_per_table 作为表空间名。

CREATE TABLE t2 (c1 INT PRIMARY KEY) TABLESPACE = innodb_file_per_table
  DATA DIRECTORY = '/alternative/directory';

当使用这种方式时,您可以不用启用innodb_file_per_table

使用说明:

  • MySQL initially holds the .ibd file open, preventing you from dismounting the device, but might eventually close the table if the server is busy. Be careful not to accidentally dismount an external device while MySQL is running, or to start MySQL while the device is disconnected. Attempting to access a table when the associated .ibd file is missing causes a serious error that requires a server restart.

    A server restart issues errors and warnings if the .ibd file is not at the expected path. In this case, you can restore the tablespace .ibd file from a backup or drop the table to remove the information about it from the data dictionary.

  • Before tables on an NFS-mounted volume, review potential issues outlined in Using NFS with MySQL.

  • If you use an LVM snapshot, file copy, or other file-based mechanism to back up the .ibd file, always use the FLUSH TABLES ... FOR EXPORT statement first to make sure all changes that were buffered in memory are flushed to disk before the backup occurs.

  • The DATA DIRECTORY clause is a supported alternative to using symbolic links, which has always been problematic and was never supported for individual InnoDB tables.

15.7.6 复制独立表空间到其他实例

本节描述如何将以个独立表空间的表从一个实例复制到另一个实例,别的地方称之为传输表空间特性。该特性还支持InnoDB分区表,以及独立的分区和子分区。

您可能有很多理由将独立的 InnoDB表空间传输到不同的实例中:

  • To run reports without putting extra load on a production server.

  • To set up identical data for a table on a new slave server.

  • To restore a backed-up version of a table or partition after a problem or mistake.

  • As a faster way of moving data around than importing the results of a mysqldump command. The data is available immediately, rather than having to be re-inserted and the indexes rebuilt.

  • To move a file-per-table tablespace to a server with storage medium that better suits system requirements. For example, you may want to have busy tables on an SSD device, or large tables on a high-capacity HDD device.

局限性和使用说明

  • 只有当 innodb_file_per_table是启用的,才能进行表空间复制流程。驻留在共享系统表空间的表不能停止。

  • 当一个表停止时,仅允许对受影响的表进行只读事务。

  • 当导入一个表空间时,页的大小必须与导入的实例相匹配。

  • ALTER TABLE ... DISCARD TABLESPACE 是支持InnoDB分区表, ALTER TABLE ... DISCARD PARTITION ... TABLESPACE 是支持InnoDB表的分区。

  • foreign_key_checks设置为1时,DISCARD TABLESPACE 不支持具有父子(主外键)关系的表空间。因此在操作有父子关系的表前,设置 foreign_key_checks=0。另外,分区的InnoDB表不支持外键。

  • ALTER TABLE ... IMPORT TABLESPACE 不会强制导入有外键约束的数据,所以,如果表之间有外键约束,应该在同样的(逻辑)时间点将所有的表都导出。另外,分区的InnoDB表不支持外键。

  • ALTER TABLE ... IMPORT TABLESPACEALTER TABLE ... IMPORT PARTITION ... TABLESPACE 不需要一个.cfg 元数据文件来导入表空间 但是,没有.cfg文件,导入时就不会执行元数据检查,并发出与下面类似的警告:

    Message: InnoDB: IO Read error: (2, No such file or directory) Error opening '.\
    test\t.cfg', will attempt to import without schema verification
    1 row in set (0.00 sec)
    

    The ability to import without a .cfg file may be more convenient when no schema mismatches are expected. Additionally, the ability to import without a .cfg file could be useful in crash recovery scenarios in which metadata cannot be collected from an .ibd file.

    If no .cfg file is used, InnoDB uses the equivalent of a SELECT MAX(ai_col) FROM table_name FOR UPDATE statement to initialize the in-memory auto-increment counter that is used in assigning values for to an AUTO_INCREMENT column. Otherwise, the current maximum auto-increment counter value is read from the .cfg metadata file. For related information, see InnoDB AUTO_INCREMENT Counter Initialization.

  • 由于 .cfg元数据文件的限制,当向一个分区表导入表空间文件时,schema的不匹配并不反馈分区类型或分区定义的差异,而反馈列的差异。

  • 当在有子分区的表上运行 ALTER TABLE ... DISCARD PARTITION ... TABLESPACEALTER TABLE ... IMPORT PARTITION ... TABLESPACE ,需要同时指定分区和子分区名,如果指定了一个分区的名,那么就包含它下面的所有子分区。

  • 如果两个实例是相同版本的不同系列,那么可以将一个表空间文件从一个MySQL服务器导入另外一个MySQL服务器,否则,需要导入到相同的服务器实例中。

  • 在复制场景,主从都必须将innodb_file_per_table设置为ON

  • 在Windows系统上,InnoDB存储库、表空间和表名都是用小写的,为了避免导入到大小写敏感的操作系统上,在创建所有的库、表空间和表时,使用小写名字。一个便捷的方法就是,在创建库、表空间和表之前,在选项文件 my.cnf my.ini 中的[mysqld]下添加下面内容:

    [mysqld]
    lower_case_table_names=1
    
  • 属于普通的InnoDB表空间的表,不支持 ALTER TABLE ... DISCARD TABLESPACEALTER TABLE ...IMPORT TABLESPACE。参阅 CREATE TABLESPACE.

  • 使用 innodb_default_row_format配置选项来配置InnoDB的默认行格式。 如果在导入表时没有明确定义行的格式(ROW_FORMAT),而目标实例设置的innodb_default_row_format与源实例的不同,结果schema不匹配错误。相关信息,请参阅 15.10.2 节, “指定一个表的行格式”

  • When exporting a tablespace that is encrypted using the InnoDB tablespace encryption feature, InnoDB generates a .cfp file in addition to a .cfg metadata file. The .cfp file must be copied to the destination instance together with the .cfg file and tablespace file before performing the ALTER TABLE ... IMPORT TABLESPACE operation on the destination instance. The .cfp file contains a transfer key and an encrypted tablespace key. On import, InnoDB uses the transfer key to decrypt the tablespace key. For related information, see 15.7.10, “InnoDB Tablespace Encryption”.

15.7.6.1 传输表空间示例

注意

如果您传输的表的表空间是加密的,操作之前请参阅 局限性和使用说明

示例 1: 复制一个InnoDB表到另一个实例

下面介绍如何正常的将一个 InnoDB表,从一个正在运行的MySQL服务器实例上,复制到另外一个运行的实例上。相同的流程适用于在相同的实例做了小的调整后,进行全表恢复。

  1. 在源实例,创建一个表如果没有的话:

    mysql> use test;
    mysql> CREATE TABLE t(c1 INT) engine=InnoDB;
    
  2. 在目标实例,创建一个表如果没有的话:

    mysql> use test;
    mysql> CREATE TABLE t(c1 INT) engine=InnoDB;
    
  3. 在目标实例上,丢弃已存在的表空间 tablespace. (在导入表空间前,InnoDB必须丢弃与接收表相关的表空间)

    mysql> ALTER TABLE t DISCARD TABLESPACE;
    
  4. 在目标实例, 运行 FLUSH TABLES ... FOR EXPORT来停止这个表,然后创建.cfg 元数据文件:

    mysql> use test;
    mysql> FLUSH TABLES t FOR EXPORT;
    

    元数据文件(.cfg)被创建在InnoDB数据目录下。

    注意

    The FLUSH TABLES ... FOR EXPORT statement ensures that changes to the named table have been flushed to disk so that a binary table copy can be made while the instance is running. When FLUSH TABLES ... FOR EXPORT is run, InnoDB produces a .cfg file in the same database directory as the table. The .cfg file contains metadata used for schema verification when importing the tablespace file.

  5. 从源实例复制.ibd.cfg文件到目标实例上 。如:

    shell> scp /path/to/datadir/test/t.{ibd,cfg} destination-server:/path/to/datadir/test
    
    注意

    必须在释放共享锁之前,将.ibd.cfg文件复制完成,正如下面步骤描述的:

  6. 在源实例上,使用 UNLOCK TABLES 来释放被 FLUSH TABLES ... FOR EXPORT获得的锁:

    mysql> use test;
    mysql> UNLOCK TABLES;
    
  7. 在目标实例上,导入表空间:

    mysql> use test;
    mysql> ALTER TABLE t IMPORT TABLESPACE;
    
    注意

    The ALTER TABLE ... IMPORT TABLESPACE feature does not enforce foreign key constraints on imported data. If there are foreign key constraints between tables, all tables should be exported at the same (logical) point in time. In this case you would stop updating the tables, commit all transactions, acquire shared locks on the tables, and then perform the export operation.

Example 2: Copying an InnoDB Partitioned Table to Another Instance

This procedure demonstrates how to copy a partitioned InnoDB table from a running MySQL server instance to another running instance. The same procedure with minor adjustments can be used to perform a full restore of a partitioned InnoDB table on the same instance.

  1. On the source instance, create a partitioned table if one does not exist. In the following example, a table with three partitions (p0, p1, p2) is created:

    mysql> use test;
    mysql> CREATE TABLE t1 (i int) ENGINE = InnoDB PARTITION BY KEY (i) PARTITIONS 3;
    

    In the /datadir/test directory, there is a separate tablespace (.ibd) file for each of the three partitions.

    mysql> \! ls /path/to/datadir/test/
    t1#P#p0.ibd  t1#P#p1.ibd  t1#P#p2.ibd
    
  2. On the destination instance, create the same partitioned table:

    mysql> use test;
    mysql> CREATE TABLE t1 (i int) ENGINE = InnoDB PARTITION BY KEY (i) PARTITIONS 3;
    

    In the /datadir/test directory, there is a separate tablespace (.ibd) file for each of the three partitions.

    mysql> \! ls /path/to/datadir/test/
    t1#P#p0.ibd  t1#P#p1.ibd  t1#P#p2.ibd
    
  3. On the destination instance, discard the tablespace for the partitioned table. (Before the tablespace can be imported on the destination instance, the tablespace that is attached to the receiving table must be discarded.)

    mysql> ALTER TABLE t1 DISCARD TABLESPACE;
    

    The three .ibd files that make up the tablespace for the partitioned table are discarded from the /datadir/test directory.

  4. On the source instance, run FLUSH TABLES ... FOR EXPORT to quiesce the partitioned table and create the .cfg metadata files:

    mysql> use test;
    mysql> FLUSH TABLES t1 FOR EXPORT;
    

    Metadata (.cfg) files, one for each tablespace (.ibd) file, are created in the /datadir/test directory on the source instance:

    mysql> \! ls /path/to/datadir/test/
    t1#P#p0.ibd  t1#P#p1.ibd  t1#P#p2.ibd
    t1#P#p0.cfg  t1#P#p1.cfg  t1#P#p2.cfg
    
    Note

    FLUSH TABLES ... FOR EXPORT statement ensures that changes to the named table have been flushed to disk so that binary table copy can be made while the instance is running. When FLUSH TABLES ... FOR EXPORT is run, InnoDB produces a .cfg metadata file for the table's tablespace files in the same database directory as the table. The .cfg files contain metadata used for schema verification when importing tablespace files. FLUSH TABLES ... FOR EXPORT can only be run on the table, not on individual table partitions.

  5. Copy the .ibd and .cfg files from the source instance database directory to the destination instance database directory. For example:

    shell> scp /path/to/datadir/test/t1*.{ibd,cfg} destination-server:/path/to/datadir/test
    
    Note

    The .ibd and .cfg files must be copied before releasing the shared locks, as described in the next step.

  6. On the source instance, use UNLOCK TABLES to release the locks acquired by FLUSH TABLES ... FOR EXPORT:

    mysql> use test;
    mysql> UNLOCK TABLES;
    
  7. On the destination instance, import the tablespace for the partitioned table:

    mysql> use test;
    mysql> ALTER TABLE t1 IMPORT TABLESPACE;
    
Example 3: Copying InnoDB Table Partitions to Another Instance

This procedure demonstrates how to copy InnoDB table partitions from a running MySQL server instance to another running instance. The same procedure with minor adjustments can be used to perform a restore of InnoDB table partitions on the same instance. In the following example, a partitioned table with four partitions (p0, p1, p2, p3) is created on the source instance. Two of the partitions (p2 and p3) are copied to the destination instance.

  1. On the source instance, create a partitioned table if one does not exist. In the following example, a table with four partitions (p0, p1, p2, p3) is created:

    mysql> use test;
    mysql> CREATE TABLE t1 (i int) ENGINE = InnoDB PARTITION BY KEY (i) PARTITIONS 4;
    

    In the /datadir/test directory, there is a separate tablespace (.ibd) file for each of the four partitions.

    mysql> \! ls /path/to/datadir/test/
    t1#P#p0.ibd  t1#P#p1.ibd  t1#P#p2.ibd t1#P#p3.ibd
    
  2. On the destination instance, create the same partitioned table:

    mysql> use test;
    mysql> CREATE TABLE t1 (i int) ENGINE = InnoDB PARTITION BY KEY (i) PARTITIONS 4;
    

    In the /datadir/test directory, there is a separate tablespace (.ibd) file for each of the four partitions.

    mysql> \! ls /path/to/datadir/test/
    t1#P#p0.ibd  t1#P#p1.ibd  t1#P#p2.ibd t1#P#p3.ibd
    
  3. On the destination instance, discard the tablespace partitions that you plan to import from the source instance. (Before tablespace partitions can be imported on the destination instance, the corresponding partitions that are attached to the receiving table must be discarded.)

    mysql> ALTER TABLE t1 DISCARD PARTITION p2, p3 TABLESPACE;
    

    The .ibd files for the two discarded partitions are removed from the /datadir/test directory on the destination instance, leaving the following files:

    mysql> \! ls /path/to/datadir/test/
    t1#P#p0.ibd  t1#P#p1.ibd
    
    Note

    When ALTER TABLE ... DISCARD PARTITION ... TABLESPACE is run on subpartitioned tables, both partition and subpartition table names are allowed. When a partition name is specified, subpartitions of that partition are included in the operation.

  4. On the source instance, run FLUSH TABLES ... FOR EXPORT to quiesce the partitioned table and create the .cfg metadata files.

    mysql> use test;
    mysql> FLUSH TABLES t1 FOR EXPORT;
    

    The metadata files (.cfg files) are created in the /datadir/test directory on the source instance. There is a .cfg file for each tablespace (.ibd) file.

    mysql> \! ls /path/to/datadir/test/
    t1#P#p0.ibd  t1#P#p1.ibd  t1#P#p2.ibd t1#P#p3.ibd
    t1#P#p0.cfg  t1#P#p1.cfg  t1#P#p2.cfg t1#P#p3.cfg
    
    Note

    FLUSH TABLES ... FOR EXPORT statement ensures that changes to the named table have been flushed to disk so that binary table copy can be made while the instance is running. When FLUSH TABLES ... FOR EXPORT is run, InnoDB produces a .cfg metadata file for the table's tablespace files in the same database directory as the table. The .cfg files contain metadata used for schema verification when importing tablespace files. FLUSH TABLES ... FOR EXPORT can only be run on the table, not on individual table partitions.

  5. Copy the .ibd and .cfg files from the source instance database directory to the destination instance database directory. In this example, only the .ibd and .cfg files for partition 2 (p2) and partition 3 (p3) are copied to the data directory on the destination instance. Partition 0 (p0) and partition 1 (p1) remain on the source instance.

    shell> scp t1#P#p2.ibd  t1#P#p2.cfg t1#P#p3.ibd t1#P#p3.cfg destination-server:/path/to/datadir/test
    
    Note

    The .ibd files and .cfg files must be copied before releasing the shared locks, as described in the next step.

  6. On the source instance, use UNLOCK TABLES to release the locks acquired by FLUSH TABLES ... FOR EXPORT:

    mysql> use test;
    mysql> UNLOCK TABLES;
    
  7. On the destination instance, import the tablespace partitions (p2 and p3):

    mysql> use test;
    mysql> ALTER TABLE t1 IMPORT PARTITION p2, p3 TABLESPACE;
    
    Note

    When ALTER TABLE ... IMPORT PARTITION ... TABLESPACE is run on subpartitioned tables, both partition and subpartition table names are allowed. When a partition name is specified, subpartitions of that partition are included in the operation.

15.7.6.2 Transportable Tablespace Internals

下面信息描述对一个常规InnoDB进行传输表空间复制流程的内部结构和错误日志信息。

当在目标实例上运行 ALTER TABLE ... DISCARD TABLESPACE

  • 表以X模式锁住。

  • 表空间与表分离。

当在源实例上运行 FLUSH TABLES ... FOR EXPORT

  • 导出的表以共享模式锁定,同时被flush。

  • purge协作线程被停止。

  • 脏页被同步到磁盘。

  • 表的元数据被写到二进制 .cfg文件。

这个操作意料之中的错误信息:

2013-09-24T13:10:19.903526Z 2 [Note] InnoDB: Sync to disk of '"test"."t"' started.
2013-09-24T13:10:19.903586Z 2 [Note] InnoDB: Stopping purge
2013-09-24T13:10:19.903725Z 2 [Note] InnoDB: Writing table metadata to './test/t.cfg'
2013-09-24T13:10:19.904014Z 2 [Note] InnoDB: Table '"test"."t"' flushed to disk
 

当在源实例上运行UNLOCK TABLES

  • 二进制 .cfg 文件被删除了。

  • 表上的共享锁被释放,并且purge协作线程被重启。

这个操作意料之中的错误信息:

2013-09-24T13:10:21.181104Z 2 [Note] InnoDB: Deleting the meta-data file './test/t.cfg'
2013-09-24T13:10:21.181180Z 2 [Note] InnoDB: Resuming purge

当在目标实例上运行 ALTER TABLE ... IMPORT TABLESPACE,导入算法对导入的每个表空间执行以下操作:

  • Each tablespace page is checked for corruption.

  • The space ID and log sequence ints (LSNs) on each page are updated

  • Flags are validated and LSN updated for the header page.

  • Btree pages are updated.

  • The page state is set to dirty so that it is written to disk.

这个操作意料之中的错误信息:

2013-07-18 15:15:01 34960 [Note] InnoDB: Importing tablespace for table 'test/t' that was exported from host 'ubuntu'
2013-07-18 15:15:01 34960 [Note] InnoDB: Phase I - Update all pages
2013-07-18 15:15:01 34960 [Note] InnoDB: Sync to disk
2013-07-18 15:15:01 34960 [Note] InnoDB: Sync to disk - done!
2013-07-18 15:15:01 34960 [Note] InnoDB: Phase III - Flush changes to disk
2013-07-18 15:15:01 34960 [Note] InnoDB: Phase IV - Flush complete
提示

您可能还会收到表空间被丢弃的告警 (如果您丢弃的是目标表的表空间) 并且,由于丢失.ibd文件而统计信息不能被计算的信息。

2013-07-18 15:14:38 34960 [Warning] InnoDB: Table "test"."t" tablespace is set as discarded.
2013-07-18 15:14:38 7f34d9a37700 InnoDB: cannot calculate statistics for table "test"."t" because the .ibd file is missing. For help, please refer to
http://dev.mysql.com/doc/refman/5.8/en/innodb-troubleshooting.html

15.7.7 配置Undo表空间

默认情况下, undo logs是 驻留在两个 undo 表空间。在之前的MySQL版本中,undo日志默认是保存在 in the 系统表空间。 undo日志的I/O 模式使得undo表空间适合使用 SSD 存储。由于长时间的运行事务会导致undo日志很大,因此在多个undo表空间中使用undo日志可以避免单个undo表空间的最大限制。

配置undo表空间的数量

innodb_undo_tablespaces 配置选项定义被InnoDB使用的undo表空间个数,默认值是2,您可以在服务器启动或运行时配置 innodb_undo_tablespaces。

增加 innodb_undo_tablespaces设置,将会创建指定数量的undo表空间,并且将其添加到活动的undo表空间列表中。减少 innodb_undo_tablespaces 设置,会从活动的undo表空间列表中移除undo表空间。但是,只有当该undo表空间中已存在的事务不再使用表空间时,这个undo表空间才能被移除。此时undo表空间只是处于非活动状态,而并非被删除了,因此可以很容易再次增加活动的undo表空间。

如果将innodb_undo_tablespaces设置为 0 ,系统表空间会被用作回滚段。如果这个值大于0,就表示系统表空间中的回滚段处于非活动状态,从而不再为其分配事务。

注意

自MySQL 8.0.3起,不在允许将 innodb_undo_tablespaces设置为 0,这几意味着回滚段不再在系统表空间中创建。 innodb_undo_tablespaces的最小值是2。

不能删除掉Undo表空间或里面的单个 但是,可以截断存储在undo表空间中的undo日志。至少需要两个undo表空间来启用undo日志截断。 更多信息,请参阅 15.7.8 节, “截断undo表空间”

Undo表空间的位置配置

Undo表空间文件的创建位置是由而配置选项 innodb_undo_directory定义的。 这个选项通常是用于将undo日志放在不同的存储设备上。如果没有指定这个路径,那么undo表空间就创建在MySQL数据目录下。 innodb_undo_directory 选项是非动态的,所以它需要重启服务器才能生效。

Undo表空间文件的命名格式是 undo_NNN,,其中 NNN 是介于1和127之间的undo空间数。这个undo空间数和undo空间ID相关联,如下:

  • undo空间数 = 0xFFFFFFF0 - undo空间ID

  • undo空间ID = 0xFFFFFFF0 - undo空间数

默认一个undo表空间的大小是10MiB。

配置回滚段的数量

配置选项innodb_rollback_segments 定义分配给每个undo表空间的 回滚段数量。这个选项可以在服务器启动或运行时修改。

提示

配置选项innodb_rollback_segments还定义了分配给 系统表空间临时表空间的回滚段数量,如果 innodb_undo_tablespaces 被设置为大于0,那么分配给系统表空间的回滚段是非活动的。

innodb_rollback_segments的默认设置是128,同样也是最大值,每个回滚段能支持最多1023个数据修改事务。

15.7.8 截断 Undo 表空间

想要截断undo表空间,必须确保最少2个undo表空间,这样是为了一个undo表空间保持活动状态,而另一个则可以被脱机进行截断。undo表空间的个数由 innodb_undo_tablespaces选项定义。查看当前的个数:

mysql> SELECT @@innodb_undo_tablespaces;
+---------------------------+
| @@innodb_undo_tablespaces |
+---------------------------+
|                         2 |
+---------------------------+

Enabling Truncation of Undo Tablespaces

要截断undo表空间,需要启用 innodb_undo_log_truncate

mysql> SET GLOBAL innodb_undo_log_truncate=ON;

当启用 innodb_undo_log_truncate 时 enabled, undo tablespace files that exceed the size limit defined by innodb_max_undo_log_size are marked for truncation. innodb_max_undo_log_size is a dynamic global variable with a default value of 1024 MiB (1073741824 bytes).

mysql> SELECT @@innodb_max_undo_log_size;
+----------------------------+
| @@innodb_max_undo_log_size |
+----------------------------+
|                 1073741824 |
+----------------------------+

You can configure innodb_max_undo_log_size using a SET GLOBAL statement:

mysql> SET GLOBAL innodb_max_undo_log_size=2147483648;

When innodb_undo_log_truncate is enabled:

  1. Undo tablespaces that exceed the innodb_max_undo_log_size setting are marked for truncation. Selection of an undo tablespace for truncation is performed in a circular fashion to avoid truncating the same undo tablespace each time.

  2. Rollback segments residing in the selected undo tablespace are made inactive so that they are not assigned to new transactions. Existing transactions that are currently using rollback segments are allowed to complete.

  3. The purge system frees rollback segments that are no longer needed.

  4. After all rollback segments in the undo tablespace are freed, the truncate operation runs and the undo tablespace is truncated to its initial size. The initial size of an undo tablespace file is 10MiB.

    Note

    The size of an undo tablespace after a truncate operation may be larger than 10MiB due to immediate use following the completion of the operation. The innodb_undo_directory option defines the location of undo tablespace files. The default value of . represents the directory where InnoDB creates other log files by default.

    mysql> SELECT @@innodb_undo_directory;
    +-------------------------+
    | @@innodb_undo_directory |
    +-------------------------+
    | .                       |
    +-------------------------+
    
  5. The rollback segments are reactivated so that they can be assigned to new transactions.

Expediting Truncation of Undo Tablespace Files

An undo tablespace cannot be truncated until its rollback segments are freed. Normally, the purge system frees rollback segments once every 128 times that purge is invoked. To expedite the truncation of undo tablespaces, use the innodb_purge_rseg_truncate_frequency option to temporarily increase the frequency with which the purge system frees rollback segments. The default innodb_purge_rseg_truncate_frequency setting is 128, which is also the maximum value.

mysql> SELECT @@innodb_purge_rseg_truncate_frequency;
+----------------------------------------+
| @@innodb_purge_rseg_truncate_frequency |
+----------------------------------------+
|                                    128 |
+----------------------------------------+

To increase the frequency with which the purge thread frees rollback segments, decrease the value of innodb_purge_rseg_truncate_frequency. For example:

mysql> SET GLOBAL innodb_purge_rseg_truncate_frequency=32;

Performance Impact of Truncating Undo Tablespace Files Online

While an undo tablespace is truncated, rollback segments in that tablespace are temporarily deactivated. The remaining active rollback segments in the other undo tablespaces assume responsibility for the entire system load, which may result in a slight performance degradation. The degree of performance degradation depends on a int of factors including:

  • undo表空间的个数

  • undo日志的个数

  • Undo表空间的大小

  • I/O 的速度

  • 执行长的事务

  • 系统负载

15.7.9 InnoDB普通表空间

一个普通表空间是一个共享的InnoDB表空间,是使用 CREATE TABLESPACE语法创建的。本节将介绍普通表空间的能力和特性 :

普通表空间的能力

普通表空间的特性提供以下能力:

  • 和系统表空间类似,普通表空间是共享的表空间,它可以存储多个表的数据。

  • General tablespaces have a potential memory advantage over file-per-table tablespaces. The server keeps tablespace metadata in memory for the lifetime of a tablespace. Multiple tables in fewer general tablespaces consume less memory for tablespace metadata than the same int of tables in separate file-per-table tablespaces.

  • General tablespace data files may be placed in a directory relative to or independent of the MySQL data directory, which provides you with many of the data file and storage management capabilities of file-per-table tablespaces. As with file-per-table tablespaces, the ability to place data files outside of the MySQL data directory allows you to manage performance of critical tables separately, setup RAID or DRBD for specific tables, or bind tables to particular disks, for example.

  • General tablespaces support both Antelope and Barracuda file formats, and therefore support all table row formats and associated features. With support for both file formats, general tablespaces have no dependence on innodb_file_format or innodb_file_per_table settings, nor do these variables have any effect on general tablespaces.

  • The TABLESPACE option can be used with CREATE TABLE to create tables in a general tablespaces, file-per-table tablespace, or in the system tablespace.

  • The TABLESPACE option can be used with ALTER TABLE to move tables between general tablespaces, file-per-table tablespaces, and the system tablespace. Previously, it was not possible to move a table from a file-per-table tablespace to the system tablespace. With the general tablespace feature, you can now do so.

创建一个普通表空间

General tablespaces are created using CREATE TABLESPACE syntax.

CREATE TABLESPACE tablespace_name
    ADD DATAFILE 'file_name'
    [FILE_BLOCK_SIZE = value]
        [ENGINE [=] engine_name]

A general tablespace may be created in the MySQL data directory or in a directory outside of the MySQL data directory. To avoid conflicts with implicitly created file-per-table tablespaces, creating a general tablespace in a subdirectory under the MySQL data directory is not supported. Also, when creating a general tablespace outside of the MySQL data directory, the directory must exist prior to creating the tablespace.

Examples:

Creating a general tablespace in the MySQL data directory:

mysql> CREATE TABLESPACE `ts1` ADD DATAFILE 'ts1.ibd' Engine=InnoDB;

Creating a general tablespace in a directory outside of the MySQL data directory:

mysql> CREATE TABLESPACE `ts1` ADD DATAFILE '/my/tablespace/directory/ts1.ibd' Engine=InnoDB;

You can specify a path that is relative to the MySQL data directory as long as the tablespace directory is not under the MySQL data directory. In this example, the my_tablespace directory is at the same level as the MySQL data directory:

mysql> CREATE TABLESPACE `ts1` ADD DATAFILE '../my_tablespace/ts1.ibd' Engine=InnoDB;
Note

The ENGINE = InnoDB clause must be defined as part of the CREATE TABLESPACE statement or InnoDB must be defined as the default storage engine (default_storage_engine=InnoDB).

Adding Tables to a General Tablespace

After creating an InnoDB general tablespace, you can use CREATE TABLE tbl_name ... TABLESPACE [=] tablespace_name or ALTER TABLE tbl_name TABLESPACE [=] tablespace_name to add tables to the tablespace, as shown in the following examples:

CREATE TABLE:

mysql> CREATE TABLE t1 (c1 INT PRIMARY KEY) TABLESPACE ts1 ROW_FORMAT=COMPACT;

ALTER TABLE:

mysql> ALTER TABLE t2 TABLESPACE ts1;

For detailed syntax information, see CREATE TABLE and ALTER TABLE.

General Tablespace Row Format Support

General tablespaces support all table row formats (REDUNDANT, COMPACT, DYNAMIC, COMPRESSED) with the caveat that compressed and uncompressed tables cannot coexist in the same general tablespace due to different physical page sizes.

For a general tablespace to contain compressed tables (ROW_FORMAT=COMPRESSED), FILE_BLOCK_SIZE must be specified, and the FILE_BLOCK_SIZE value must be a valid compressed page size in relation to the innodb_page_size value. Also, the physical page size of the compressed table (KEY_BLOCK_SIZE) must be equal to FILE_BLOCK_SIZE/1024. For example, if innodb_page_size=16K and FILE_BLOCK_SIZE=8K, the KEY_BLOCK_SIZE of the table must be 8.

The following table shows permitted FILE_BLOCK_SIZE and KEY_BLOCK_SIZE values for each innodb_page_size value. FILE_BLOCK_SIZE values may also be specified in bytes. To determine a valid KEY_BLOCK_SIZE value for a given FILE_BLOCK_SIZE, divide the FILE_BLOCK_SIZE value by 1024. Table compression is not support for 32K and 64K InnoDB page sizes. For more information about KEY_BLOCK_SIZE, see CREATE TABLE, and 15.9.1.2 节, “Creating Compressed Tables”.

Table 15.5 FILE_BLOCK_SIZE and KEY_BLOCK_SIZE Values for CREATE TABLESPACE

InnoDB Page Size (innodb_page_size)Permitted FILE_BLOCK_SIZE ValuesPermitted KEY_BLOCK_SIZE Values
64K64K (65536)Compression is not supported
32K32K (32768)Compression is not supported
16K16K (16384)N/A: If innodb_page_size is equal to FILE_BLOCK_SIZE, the tablespace cannot contain a compressed table.
8K (8192)8
4K (4096)4
2K (2048)2
1K (1024)1
8K8K (8192)N/A: If innodb_page_size is equal to FILE_BLOCK_SIZE, the tablespace cannot contain a compressed table.
4K (4096)4
2K (2048)2
1K (1024)1
4K4K (4096)N/A: If innodb_page_size is equal to FILE_BLOCK_SIZE, the tablespace cannot contain a compressed table.
2K (2048)2
1K (1024)1

This example demonstrates creating a general tablespace and adding a compressed table. The example assumes a default innodb_page_size of 16K. The FILE_BLOCK_SIZE of 8192 requires that the compressed table have a KEY_BLOCK_SIZE of 8.

mysql> CREATE TABLESPACE `ts2` ADD DATAFILE 'ts2.ibd' FILE_BLOCK_SIZE = 8192 Engine=InnoDB;
Query OK, 0 rows affected (0.01 sec)

mysql> CREATE TABLE t4 (c1 INT PRIMARY KEY) TABLESPACE ts2 ROW_FORMAT=COMPRESSED
KEY_BLOCK_SIZE=8;
Query OK, 0 rows affected (0.00 sec)

If you do not specify FILE_BLOCK_SIZE when creating a general tablespace, FILE_BLOCK_SIZE defaults to innodb_page_size. When FILE_BLOCK_SIZE is equal to innodb_page_size, the tablespace may only contain tables with an uncompressed row format (COMPACT, REDUNDANT, and DYNAMIC row formats).

Moving Non-Partitioned Tables Between Tablespaces Using ALTER TABLE

You can use ALTER TABLE with the TABLESPACE option to move a non-partitioned InnoDB table to an existing general tablespace, to a new file-per-table tablespace, or to the system tablespace.

To move a non-partitioned table from a file-per-table tablespace or from the system tablespace to a general tablespace, specify the name of the general tablespace. The general tablespace must exist. See CREATE TABLESPACE for more information.

ALTER TABLE tbl_name TABLESPACE [=] tablespace_name

To move a non-partitioned table from a general tablespace or file-per-table tablespace to the system tablespace, specify innodb_system as the tablespace name.

ALTER TABLE tbl_name ... TABLESPACE [=] innodb_system

To move a non-partitioned table from the system tablespace or a general tablespace to a file-per-table tablespace, specify innodb_file_per_table as the tablespace name.

ALTER TABLE tbl_name ... TABLESPACE [=] innodb_file_per_table

ALTER TABLE ... TABLESPACE operations always cause a full table rebuild, even if the TABLESPACE attribute has not changed from its previous value.

ALTER TABLE ... TABLESPACE syntax does not support moving a table from a temporary tablespace to a persistent tablespace.

The DATA DIRECTORY clause is permitted with CREATE TABLE ... TABLESPACE=innodb_file_per_table but is otherwise not supported for use in combination with the TABLESPACE option.

General Tablespace Table Partition Support

The TABLESPACE option may be used to assign individual table partitions or subpartitions to a general tablespace, a separate file-per-table tablespace, or the system tablespace. All partitions must belong to the same storage engine. Usage is demonstrated in the following examples.

mysql> CREATE TABLESPACE `ts1` ADD DATAFILE 'ts1.ibd' Engine=InnoDB;
mysql> CREATE TABLESPACE `ts2` ADD DATAFILE 'ts2.ibd' Engine=InnoDB;

mysql> CREATE TABLE t1 (a INT, b INT) ENGINE = InnoDB
    ->  PARTITION BY RANGE(a) SUBPARTITION BY KEY(b) (
    ->    PARTITION p1 VALUES LESS THAN (100) TABLESPACE=`ts1`,
    ->    PARTITION p2 VALUES LESS THAN (1000) TABLESPACE=`ts2`,
    ->    PARTITION p3 VALUES LESS THAN (10000) TABLESPACE `innodb_file_per_table`,
    ->    PARTITION p4 VALUES LESS THAN (100000) TABLESPACE `innodb_system`);

mysql> CREATE TABLE t2 (a INT, b INT) ENGINE = InnoDB
    ->  PARTITION BY RANGE(a) SUBPARTITION BY KEY(b) (
    ->    PARTITION p1 VALUES LESS THAN (100) TABLESPACE=`ts1`
    ->      (SUBPARTITION sp1,
    ->       SUBPARTITION sp2),
    ->    PARTITION p2 VALUES LESS THAN (1000)
    ->      (SUBPARTITION sp3,
    ->       SUBPARTITION sp4 TABLESPACE=`ts2`),
    ->    PARTITION p3 VALUES LESS THAN (10000)
    ->      (SUBPARTITION sp5 TABLESPACE `innodb_system`,
    ->       SUBPARTITION sp6 TABLESPACE `innodb_file_per_table`));

The TABLESPACE option is also supported with ALTER TABLE.

mysql> ALTER TABLE t1 ADD PARTITION (PARTITION p5 VALUES LESS THAN (1000000) TABLESPACE = `ts1`);
Note

If the TABLESPACE = tablespace_name option is not defined, the ALTER TABLE ... ADD PARTITION operation adds the partition to the table's default tablespace, which can be specified at the table level during CREATE TABLE or ALTER TABLE.

An ALTER TABLE tbl_name TABLESPACE [=] tablespace_name operation on a partitioned table only modifies the table's default tablespace. It does not move the table partitions. However, after changing the default tablespace, an operation that rebuilds the table, such as an ALTER TABLE operation that uses ALGORITHM=COPY, moves the partitions to the default tablespace if another tablespace is not defined explicitly using the TABLESPACE [=] tablespace_name clause.

To verify that partitions were placed in the specified tablespaces, you can query INFORMATION_SCHEMA.INNODB_SYS_TABLES:

mysql> SELECT NAME, SPACE, SPACE_TYPE FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES
    -> WHERE NAME LIKE '%t1%';
+-----------------------+-------+------------+
| NAME                  | SPACE | SPACE_TYPE |
+-----------------------+-------+------------+
| test/t1#P#p1#SP#p1sp0 |    57 | General    |
| test/t1#P#p2#SP#p2sp0 |    58 | General    |
| test/t1#P#p3#SP#p3sp0 |    59 | Single     |
| test/t1#P#p4#SP#p4sp0 |     0 | System     |
| test/t1#P#p5#SP#p5sp0 |    57 | General    |
+-----------------------+-------+------------+

mysql> SELECT NAME, SPACE, SPACE_TYPE FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES
    -> WHERE NAME LIKE '%t2%';
+---------------------+-------+------------+
| NAME                | SPACE | SPACE_TYPE |
+---------------------+-------+------------+
| test/t2#P#p1#SP#sp1 |    57 | General    |
| test/t2#P#p1#SP#sp2 |    57 | General    |
| test/t2#P#p2#SP#sp3 |    60 | Single     |
| test/t2#P#p2#SP#sp4 |    58 | General    |
| test/t2#P#p3#SP#sp5 |     0 | System     |
| test/t2#P#p3#SP#sp6 |    61 | Single     |
+---------------------+-------+------------+

Moving Table Partitions Between Tablespaces Using ALTER TABLE

To move table partitions to a different tablespace, you must move each partition using an ALTER TABLE tbl_name REORGANIZE PARTITION statement.

The following example demonstrates how to move table partitions to a different tablespace. INFORMATION_SCHEMA.INNODB_SYS_TABLES and INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES are queried to verify that partitions are placed in the expected tablespace.

Note

If the TABLESPACE = tablespace_name option is not defined in the REORGANIZE PARTITION statement, InnoDB moves the partition to the table's default tablespace. In the example that follows, tablespace ts1, which is defined at the table level, is the default tablespace for table t1. Partition P3 is moved from the system tablespace to tablespace ts1 since no TABLESPACE option is specified in the ALTER TABLE t1 REORGANIZE PARTITION statement for partition P3.

An operation that rebuilds the table, such as an ALTER TABLE operation that uses ALGORITHM=COPY, moves partitions to the default tablespace if partitions reside in a different tablespace that is not defined explicitly using the TABLESPACE [=] tablespace_name clause.

mysql> CREATE TABLESPACE ts1 ADD DATAFILE 'ts1.ibd';
mysql> CREATE TABLESPACE ts2 ADD DATAFILE 'ts2.ibd';

mysql> CREATE TABLE t1 ( a INT NOT NULL, PRIMARY KEY (a))
    ->  ENGINE=InnoDB TABLESPACE ts1                          
    ->  PARTITION BY RANGE (a) PARTITIONS 3 (
    ->    PARTITION P1 VALUES LESS THAN (2),
    ->    PARTITION P2 VALUES LESS THAN (4) TABLESPACE `innodb_file_per_table`,
    ->    PARTITION P3 VALUES LESS THAN (6) TABLESPACE `innodb_system`);


mysql> SELECT A.NAME as partition_name, A.SPACE_TYPE as space_type, B.NAME as space_name
    -> FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES A
    -> LEFT JOIN INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES B
    -> ON A.SPACE = B.SPACE WHERE A.NAME LIKE '%t1%' ORDER BY A.NAME;
+----------------+------------+--------------+
| partition_name | space_type | space_name   |
+----------------+------------+--------------+
| test/t1#P#P1   | General    | ts1          |
| test/t1#P#P2   | Single     | test/t1#P#P2 |
| test/t1#P#P3   | System     | NULL         |
+----------------+------------+--------------+

mysql> ALTER TABLE t1 REORGANIZE PARTITION P1
    -> INTO (PARTITION P1 VALUES LESS THAN (2) TABLESPACE = `ts2`);
  
mysql> ALTER TABLE t1 REORGANIZE PARTITION P2
    -> INTO (PARTITION P2 VALUES LESS THAN (4) TABLESPACE = `ts2`);
  
mysql> ALTER TABLE t1 REORGANIZE PARTITION P3
    -> INTO (PARTITION P3 VALUES LESS THAN (6));

mysql> SELECT A.NAME AS partition_name, A.SPACE_TYPE AS space_type, B.NAME AS space_name
    -> FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES A
    -> LEFT JOIN INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES B
    -> ON A.SPACE = B.SPACE WHERE A.NAME LIKE '%t1%' ORDER BY A.NAME;
+----------------+------------+------------+
| partition_name | space_type | space_name |
+----------------+------------+------------+
| test/t1#P#P1   | General    | ts2        |
| test/t1#P#P2   | General    | ts2        |
| test/t1#P#P3   | General    | ts1        |
+----------------+------------+------------+

删掉一个普通表空间

The DROP TABLESPACE statement is used to drop an InnoDB general tablespace.

All tables must be dropped from the tablespace prior to a DROP TABLESPACE operation. If the tablespace is not empty, DROP TABLESPACE returns an error.

A general InnoDB tablespace is not deleted automatically when the last table in the tablespace is dropped. The tablespace must be dropped explicitly using DROP TABLESPACE tablespace_name.

A general tablespace does not belong to any particular database. A DROP DATABASE operation can drop tables that belong to a general tablespace but it cannot drop the tablespace, even if the DROP DATABASE operation drops all tables that belong to the tablespace. A general tablespace must be dropped explicitly using DROP TABLESPACE tablespace_name.

Similar to the system tablespace, truncating or dropping tables stored in a general tablespace creates free space internally in the general tablespace .ibd data file which can only be used for new InnoDB data. Space is not released back to the operating system as it is when a file-per-table tablespace is deleted during a DROP TABLE operation.

This example demonstrates how to drop an InnoDB general tablespace. The general tablespace ts1 is created with a single table. The table must be dropped before dropping the tablespace.

mysql> CREATE TABLESPACE `ts1` ADD DATAFILE 'ts1.ibd' Engine=InnoDB;
Query OK, 0 rows affected (0.01 sec)

mysql> CREATE TABLE t1 (c1 INT PRIMARY KEY) TABLESPACE ts10 Engine=InnoDB;
Query OK, 0 rows affected (0.02 sec)

mysql> DROP TABLE t1;
Query OK, 0 rows affected (0.01 sec)

mysql> DROP TABLESPACE ts1;
Query OK, 0 rows affected (0.01 sec)
提示

在MySQL中区分tablespace_name的大小写。

普通表空间的局限性

  • A generated or existing tablespace cannot be changed to a general tablespace.

  • Creation of temporary general tablespaces is not supported.

  • General tablespaces do not support temporary tables.

  • Similar to the system tablespace, truncating or dropping tables stored in a general tablespace creates free space internally in the general tablespace .ibd data file which can only be used for new InnoDB data. Space is not released back to the operating system as it is for file-per-table tablespaces.

    Additionally, a table-copying ALTER TABLE operation on table that resides in a shared tablespace (a general tablespace or the system tablespace) can increase the amount of space used by the tablespace. Such operations require as much additional space as the data in the table plus indexes. The additional space required for the table-copying ALTER TABLE operation is not released back to the operating system as it is for file-per-table tablespaces.

  • ALTER TABLE ... DISCARD TABLESPACE and ALTER TABLE ...IMPORT TABLESPACE are not supported for tables that belong to a general tablespace.

For more information see 13.1.16, “CREATE TABLESPACE Syntax”.

15.7.10 InnoDB表空间加密

InnoDB supports data encryption for InnoDB tables stored in file-per-table tablespaces. This feature provides at-rest encryption for physical tablespace data files.

InnoDB tablespace encryption uses a two tier encryption key architecture, consisting of a master encryption key and tablespace keys. When an InnoDB table is encrypted, a tablespace key is encrypted and stored in the tablespace header. When an application or authenticated user wants to access encrypted tablespace data, InnoDB uses the master encryption key to decrypt the tablespace key. The master encryption key is stored in a keyring file in the location specified by the keyring_file_data configuration option. The decrypted version of a tablespace key never changes, but the master encryption key may be changed as required. This action is referred to as master key rotation.

The InnoDB tablespace encryption feature relies on a keyring plugin for master encryption key management.

InnoDB tablespace encryption supports the Advanced Encryption Standard (AES) block-based encryption algorithm. It uses Electronic Codebook (ECB) block encryption mode for tablespace key encryption and Cipher Block Chaining (CBC) block encryption mode for data encryption.

Note

The InnoDB tablespace encryption feature provided with MySQL Community Edition is not intended as a regulatory compliance solution. Security standards such as PCI, FIPS, and others require use of key management systems to secure, manage, and protect keys in key vaults or hardware security modules (HSMs).

For frequently asked questions about the InnoDB tablespace encryption feature, see Section A.15, “MySQL 8.0 FAQ: InnoDB Tablespace Encryption”.

InnoDB表空间密码先决条件

  • The keyring_file plugin must be installed. Keyring plugin installation is performed at startup using the --early-plugin-load option. Early loading ensures that the plugin is available prior to initialization of the InnoDB storage engine. For keyring plugin installation and configuration instructions, see Section 6.5.4 节, “The MySQL Keyring”.

    重要

    Once encrypted tables are created in a MySQL instance, the keyring plugin must continue to be loaded using the early-plugin-load option, prior to InnoDB initialization. Failing to do so results in errors on startup and during InnoDB recovery.

    验证密钥插件是活跃的,可以使用 SHOW PLUGINS 语句或查询 INFORMATION_SCHEMA.PLUGINS 表。例如:

    mysql> SELECT PLUGIN_NAME, PLUGIN_STATUS
           FROM INFORMATION_SCHEMA.PLUGINS
           WHERE PLUGIN_NAME LIKE 'keyring%';
    +--------------+---------------+
    | PLUGIN_NAME  | PLUGIN_STATUS |
    +--------------+---------------+
    | keyring_file | ACTIVE        |
    +--------------+---------------+
    
  • 必须启用innodb_file_per_table选项(默认启用)。 InnoDB 表空间加密仅支持独立表空 file-per-table tablespaces. 除此之外,当创建一个加密的表,或修改一个已存在的表为加密,时,您可以为表指定 TABLESPACE='innodb_file_per_table' 选项。

  • Before using the InnoDB tablespace encryption feature, ensure that you have taken steps to prevent loss of the master encryption key. If the master encryption key is lost, data stored in encrypted tablespace files is unrecoverable. It is recommended that you create a backup of the keyring file immediately after creating the first encrypted table and before and after master key rotation. The keyring file location is defined by the keyring_file_data configuration option. For keyring plugin configuration information, see Section 6.5.4 节, “The MySQL Keyring”.

启用和禁用InnoDB表空间加密

CREATE TABLE语句中指定ENCRYPTION选项来对一个新的InnoDB表启用加密。

mysql> CREATE TABLE t1 (c1 INT) ENCRYPTION='Y';

如果是一个已经存在的InnoDB表加密,可以使用ALTER TABLE 语句,指定ENCRYPTION选项。

mysql> ALTER TABLE t1 ENCRYPTION='Y';

使用ALTER TABLE语句,设置ENCRYPTION='N' 就可以禁用表的加密。

mysql> ALTER TABLE t1 ENCRYPTION='N';
注意

对已存在的表进行加密操作时,需要选择适当的时候, 因为ALTER TABLE ... ENCRYPTION操作是使用 ALGORITHM=COPY重建表。不支持 ALGORITM=INPLACE

重做日志数据加密

可以使用 innodb_redo_log_encrypt配置选项来启用重做日志加密。重做日志加密默认是禁用的。

因为是表空间数据,所以重做日志数据的加密发生在重做日志数据被写入磁盘,以及当从磁盘读取重做日志数据的时候。一旦重做日志数据读到内存中,它就是非加密格式。 重做日志数据的加密和解密是使用表空间加密密钥。

innodb_redo_log_encrypt是启用的,原来在磁盘上未加密的重做日志页,在磁盘上仍然是未加密的。并且,新写入到磁盘的重做日志页是加密的格式。同样的,当 innodb_redo_log_encrypt是禁用的,原来在磁盘上的加密重做日志页,仍然是加密的,而新写入到磁盘的重做日志页是未加密的。

重做日志加密元数据和表空间的加密密钥,是被保存在 第一个重做日志文件(ib_logfile0)的头部。如果这个文件被移动了,那么重做日志密码就禁用了。

一旦启用了重做日志加密,那么,当没有密钥插件或加密密钥的时候,服务器不能正常的重启。因为在启动的过程中, InnoDB 必须能够扫描重做日志页,如果此时重做日志是加密的,那么就不可能重启成功。当没有密钥插件或加密密钥时,这种情况下,就相当于没有重做日志,只能强制启动。参阅 15.20.2 节, “强制InnoDB恢复”

Undo日志数据加密

使用 innodb_undo_log_encrypt配置选项可以启用undo日志数据加密。只能对使用独立表空间的undo日志加密,不能对驻留在系统表空间中的undo日志数据加密。默认是关闭undo日志数据加密。

因为是表空间数据,所以当往磁盘写undo日志数据和从磁盘读undo日志数据时才会出现undo日志数据加密。一旦undo日志数据被读到内存中,那么它就是未加密格式。undo日志数据的加密和解密是使用表空间加密密钥。

innodb_undo_log_encrypt是启用的,原来在磁盘上未加密的undo日志页,在磁盘上仍然是未加密的。并且,新写入到磁盘的undo日志页是加密的格式。同样的,当 innodb_undo_log_encrypt是禁用的,原来在磁盘上的加密undo日志页,仍然是加密的,而新写入到磁盘的undo日志页是未加密的。

Undo加密元数据和表空间的加密密钥,是被保存在undo日志文件 (undoN.ibd的头部,其中N 是空间ID)。

InnoDB Tablespace Encryption and Master Key Rotation

The master encryption key should be rotated periodically and whenever you suspect that the key may have been compromised.

Master key rotation is an atomic, instance-level operation. Each time the master encryption key is rotated, all tablespace keys in the MySQL instance are re-encrypted and saved back to their respective tablespace headers. As an atomic operation, re-encryption must succeed for all tablespace keys once a rotation operation is initiated. If master key rotation is interrupted by a server failure, InnoDB rolls the operation forward on server restart. For more information, see InnoDB Tablespace Encryption and Recovery.

Rotating the master encryption key only changes the master encryption key and re-encrypts tablespace keys. It does not decrypt or re-encrypt associated tablespace data.

Rotating the master encryption key requires the ENCRYPTION_KEY_ADMIN or SUPER privilege.

To rotate the master encryption key, run:

mysql> ALTER INSTANCE ROTATE INNODB MASTER KEY;

ALTER INSTANCE ROTATE INNODB MASTER KEY supports concurrent DML. However, it cannot be run concurrently with CREATE TABLE ... ENCRYPTED or ALTER TABLE ... ENCRYPTED operations, and locks are taken to prevent conflicts that could arise from concurrent execution of these statements. If one of the conflicting statements is running, it must complete before another can proceed.

InnoDB表空间加密和恢复

If a server failure occurs during master key rotation, InnoDB continues the operation on server restart.

The keyring plugin must be loaded prior to storage engine initialization so that the information necessary to decrypt tablespace data pages can be retrieved from tablespace headers before InnoDB initialization and recovery activities access tablespace data. (See InnoDB Tablespace Encryption Prerequisites.)

When InnoDB initialization and recovery begin, the master key rotation operation resumes. Due to the server failure, some tablespaces keys may already be encrypted using the new master encryption key. InnoDB reads the encryption data from each tablespace header, and if the data indicates that the tablespace key is encrypted using the old master encryption key, InnoDB retrieves the old key from the keyring and uses it to decrypt the tablepace key. InnoDB then re-encrypts the tablespace key using the new master encryption key and saves the re-encrypted tablespace key back to the tablespace header.

导出加密的表

当导出一个加密的表,InnoDB会生成一个用于解密表空间密钥的 transfer key,加密的表空间密钥和 transfer key被保存在一个 tablespace_name.cfp 文件中。在执行导入操作时同时需要这两个文件,在导入时, InnoDB 使用这个 transfer key来解密 tablespace_name.cfp文件中的表空间密钥。相关信息请参阅, 15.7.6 节, “复制独立表空间到其他实例”

InnoDB 表空间加密和复制

  • ALTER INSTANCE ROTATE INNODB MASTER KEY 语句仅在复制环境支持,并且主从MySQL服务器都需要支持表空间加密功能。

  • 成功执行的 ALTER INSTANCE ROTATE INNODB MASTER KEY 语句会被写入用于slave复制的二进制日志中。

  • 如果ALTER INSTANCE ROTATE INNODB MASTER KEY 语句执行失败,就不会写入到二进制日志中,并且不会在slave上复制。

  • 如果只在master服务器上安装了密钥插件,而在slave上没有安装,那么 ALTER INSTANCE ROTATE INNODB MASTER KEY操作会失败。

  • 如果都安装了密钥插件,但是slave上没有 keyring数据文件,那么 ALTER INSTANCE ROTATE INNODB MASTER KEY语句会slave上创建keyring数据文件。 如果内存中缓存的密钥数据可用, ALTER INSTANCE ROTATE INNODB MASTER KEY会使用内存中的。

识别加密过的InnoDB表

当在 CREATE TABLEALTER TABLE语句中指定了ENCRYPTION选项,那么它就会记录在 INFORMATION_SCHEMA.TABLES表的 CREATE_OPTIONS 列。我们可以在MySQL实例中查询这个列。

mysql> SELECT TABLE_SCHEMA, TABLE_NAME, CREATE_OPTIONS FROM INFORMATION_SCHEMA.TABLES
    -> WHERE CREATE_OPTIONS LIKE '%ENCRYPTION="Y"%';
+--------------+------------+----------------+
| TABLE_SCHEMA | TABLE_NAME | CREATE_OPTIONS |
+--------------+------------+----------------+
| test         | t1         | ENCRYPTION="Y" |
+--------------+------------+----------------+

InnoDB表空间加密使用说明

  • If the server exits or is stopped during normal operation, it is recommended to restart the server using the same encryption settings that were configured previously.

  • The first master encryption key is generated when the first new or existing table is encrypted.

  • Master key rotation re-encrypts tablespaces keys but does not change the tablespace key itself. To change a tablespace key, you must disable and re-enable table encryption using ALTER TABLE tbl_name ENCRYPTION, which is an ALGORITHM=COPY operation that rebuilds the table.

  • 如果表是同时使用 COMPRESSIONENCRYPTION 选项创建的,压缩会在加密前执行

  • 如果keyring文件是空的或丢失了,首先执行 ALTER INSTANCE ROTATE INNODB MASTER KEY 来创建一个master加密密钥。

  • 卸载keyring_file 插件不会移除已存在的keyring文件。

  • 建议把keyring 文件和表空间数据文件放在同样的目录下, keyring 文件的位置是由 keyring_file_data 选项指定的。

  • 当服务器在运行和重启时,用新的设置修改了 keyring_file_data选项,那么会导致之前加密过的表变得无法访问,结果就是丢失数据。

InnoDB表空间加密的局限性

  • 仅支持AES(Advanced Encryption Standard)加密算法。InnoDB 表空间加密是使用AES的ECB(Electronic Codebook)部分的加密模式对表空间密钥加密,使用AES的CBC(Cipher Block Chaining)部分的加密模式对数据加密。

  • 修改表的加密属性是一个ALGORITHM=COPY 操作,不支持ALGORITHM=INPLACE

  • InnoDB 表空间加密仅支持使用独立表空间存储的InnoDB表,不支持存储在其他表空间类型的表,包括:普通表空间系统表空间、undo日志表空间和临时表空。

  • 您不能将一个加密的使用独立表空间的表,移动或复制到一个不支持InnoDB表空间类型的实例中。

  • 默认情况下,表空间加密仅适用于在表空间里面的数据,重做日志和undo日志数据可以使用 innodb_redo_log_encryptinnodb_undo_log_encrypt选项加密。参阅 重做日志数据加密, 和 Undo日志数据加密。 二进制日志的数据不能加密。

15.8 InnoDB 表和索引

本节覆盖了与 InnoDB表和索引相关的主题。

15.8.1 InnoDB 表

本节的主题是与InnoDB表相关。

15.8.1.1 创建InnoDB表

使用 CREATE TABLE 语句创建一个InnoDB表。

CREATE TABLE t1 (a INT, b CHAR (20), PRIMARY KEY (a)) ENGINE=InnoDB;

如果InnoDB是被定义为默认的存储引擎,那么您不需要指定ENGINE=InnoDB, 虽然InnoDB已经是MySQL的默认存储引擎。 执行下面语句,检查数据库的默认存储引擎:

mysql> SELECT @@default_storage_engine;
+--------------------------+
| @@default_storage_engine |
+--------------------------+
| InnoDB                   |
+--------------------------+

如果服务器上的默认存储引擎不是InnoDB,而您有打算使用mysqldump或者主从复制重新执行 CREATE TABLE语句,那么您仍然可以使用 ENGINE=InnoDB

一个 InnoDB表及其索引可以被创建在 系统表空间,一个独立表空间普通表空间 tablespace, or in a general tablespace. When 当启用了innodb_file_per_table enabled, which is the default, an InnoDB table is implicitly created in an individual file-per-table tablespace. Conversely, when innodb_file_per_table is disabled, an InnoDB table is implicitly created in the InnoDB system tablespace. To create a table in a general tablespace, use CREATE TABLE ... TABLESPACE syntax. For more information, see 15.7.9 节, “InnoDB General Tablespaces”.

When you create a table in a file-per-table tablespace, MySQL creates an .ibd tablespace file in a database directory under the MySQL data directory, by default. A table created in the InnoDB system tablespace is created in an existing ibdata file, which resides in the MySQL data directory. A table created in a general tablespace is created in an existing general tablespace .ibd file. General tablespace files can be created inside or outside of the MySQL data directory. For more information, see 15.7.9 节, “InnoDB General Tablespaces”.

Internally, InnoDB adds an entry for each table to the data dictionary. The entry includes the database name. For example, if table t1 is created in the test database, the data dictionary entry for the database name is 'test/t1'. This means you can create a table of the same name (t1) in a different database, and the table names do not collide inside InnoDB.

InnoDB表和行格式

InnoDB表的默认行格式是由配置选项 innodb_default_row_format定义的。其中默认值是DYNAMICDynamicCompressed 的行格式使得您可以更好的利用 InnoDB特性,如,表压缩和长列值页外的有效存储。要使用这些行格式, innodb_file_per_table 必须被启用(默认启用)。

SET GLOBAL innodb_file_per_table=1;
CREATE TABLE t3 (a INT, b CHAR (20), PRIMARY KEY (a)) ROW_FORMAT=DYNAMIC;
CREATE TABLE t4 (a INT, b CHAR (20), PRIMARY KEY (a)) ROW_FORMAT=COMPRESSED;

除此之外,您还可以使用 CREATE TABLE ... TABLESPACE 语法来创建一个InnoDB表在普通表空间中。普通表空间支持所有的行格式。更多信息,请参阅 15.7.9 节, “InnoDB 普通表空间”

CREATE TABLE t1 (c1 INT PRIMARY KEY) TABLESPACE ts1 ROW_FORMAT=DYNAMIC;

CREATE TABLE ... TABLESPACE 语法还可以用于在系统表空间中创建 Dynamic行格式的 InnoDB 表。在表的另一边加上 CompactRedundant 的行格式。

CREATE TABLE t1 (c1 INT PRIMARY KEY) TABLESPACE = innodb_system ROW_FORMAT=DYNAMIC;

关于InnoDB 行格式的信息,请参阅 15.10 节, “InnoDB行存储和行格式”。对于如何决定InnoDB的行格式,以及InnoDB行格式的物理特性,请参阅15.8.1.2 节, “InnoDB表行的物理行结构”

InnoDB表和主键

通常对InnoDB表定义一个主键,指定单列或多列:

  • 被最重要的查询所引用。

  • 不为空。

  • 没有重复的值。

  • 值插入后几乎很少更新。

举个例子,一个关于人的信息表,您不会在(firstname, lastname)字段上创建一个主键。因为有些有人是重名,有些人姓是空的,并且有时候人们会修改他们的名字。由于有这么多的因素,通常就没有合适的列可以设为主键。因此这个时候,您可以添加一个数值型字段ID来作为主键,或是主键的一部分。您可以定义为一个 自动增长列,以至于会随着插入自动的填充和增长:

-- The value of ID can act like a pointer between related items in different tables.
CREATE TABLE t5 (id INT AUTO_INCREMENT, b CHAR (20), PRIMARY KEY (id));

-- The primary key can consist of more than one column. Any autoinc column must come first.
CREATE TABLE t6 (id INT AUTO_INCREMENT, a INT, b CHAR (20), PRIMARY KEY (id,a));

尽管没有主键的表可以正常工作。但是主键涉及到性能的多个方面,对于任何大型表或频繁使用的表来说,主键的设计都是至关重要的。通常建议在创建表的时候就指定主键,如果是先建表,导入数据,然后使用ALTER TABLE添加主键的话,这种形式要比一开始建表的时候加上主键慢。

观察InnoDB表的属性

执行一个SHOW TABLE STATUS语句可以观察InnoDB表的相关属性。 statement:

mysql> SHOW TABLE STATUS FROM test LIKE 't%' \G;
*************************** 1. row ***************************
           Name: t1
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 0
 Avg_row_length: 0
    Data_length: 16384
Max_data_length: 0
   Index_length: 0
      Data_free: 0
 Auto_increment: NULL
    Create_time: 2015-03-16 15:13:31
    Update_time: NULL
     Check_time: NULL
      Collation: utf8mb4_0900_ai_ci
       Checksum: NULL
 Create_options:
        Comment:

也可以通过查询 INFORMATION_SCHEMA表来查看InnoDB表的属性信息:

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE NAME='test/t1' \G
*************************** 1. row ***************************
     TABLE_ID: 45
         NAME: test/t1
         FLAG: 1
       N_COLS: 5
        SPACE: 35
   ROW_FORMAT: Compact
ZIP_PAGE_SIZE: 0
   SPACE_TYPE: Single

15.8.1.2 InnoDB表行的物理结构

InnoDB表行的物理结构取决于建表时指定的行格式。如果建表时没有指定行的格式,那么就使用默认的格式,默认的格式是由配置选项 innodb_default_row_format定义的。默认值是DYNAMIC

下面的小节描述了InnoDB行格式的特性。

更多关于InnoDB行格式的信息,请参阅 15.10, “InnoDB行存储和行格式”

确定一个InnoDB表行的格式

您可以使用SHOW TABLE STATUS来确定一个 InnoDB表行的格式,例如:

mysql> SHOW TABLE STATUS IN test1\G
*************************** 1. row ***************************
           Name: t1
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 0
 Avg_row_length: 0
    Data_length: 16384
Max_data_length: 0
   Index_length: 16384
      Data_free: 0
 Auto_increment: 1
    Create_time: 2016-09-14 16:29:38
    Update_time: NULL
     Check_time: NULL
      Collation: utf8mb4_0900_ai_ci
       Checksum: NULL
 Create_options: 
        Comment: 

您也可以通过查询 INFORMATION_SCHEMA.INNODB_SYS_TABLES来确定一个InnoDB表行的格式。

mysql> SELECT NAME, ROW_FORMAT FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE NAME='test1/t1';
+----------+------------+
| NAME     | ROW_FORMAT |
+----------+------------+
| test1/t1 | Dynamic    |
+----------+------------+
Redundant行格式的特性

REDUNDANT格式可用来保持着与MySQL老版本的兼容性。

使用REDUNDANT行格式的InnoDB表下面特性:

  • 每个索引记录都含有一个6字节的头部,这个6字节的头部用于将连续的记录链接在一起,同样还用于行级锁。

  • 聚集索引的记录包含所有用户定义的列。此外,还有一个6字节的事务ID字段和一个7字节的滚动指针字段。

  • 如果没有给表定义主键。那么每个聚集索引记录还包含一个6字节的行ID字段。

  • 每个辅助索引记录还包含所有为聚集索引定义,而这些主键字段不在辅助索引中的主键字段。

  • 一个记录包含一个指向记录中每个字段的指针,如果一个记录中字段总长少于128个字节,那么这个指针就是1个字节,否则就是2个字节。这些指针的数组被称为记录目录,指针指向的区域被称为记录的数据。

  • 在内部,InnoDB以储存固定长度字符串列,如 CHAR(10) 是一个固定长度格式。InnoDB不会截断 VARCHAR列的结尾空格。

  • InnoDB 将长度大于或等于768字节的固定长度字段编码为可变长度字段,这样可以储存在下一个页。例如: 如果字符集的最大字节长度大于3,那么一个CHAR(255) 列可以超过768字节。就与 utf8mb4一样。

  • SQL中的NULL值在记录目录中保留1个或2个字节。除此之外,如果是储存在可变长度列, NULL值在记录的数据部分是保留0字节。在固定长度的列中,记录中的数据部分是保留列的固定长度, 为NULL值保留固定的空间,使得可以将 NULL 更新为非NULL,从而不造成索引页碎片。

COMPACT行格式的特性

REDUNDANT格式相比,COMPACT行格式的行储存空间减少了大约20%,但代价就是某些操作的CPU使用量增加。 如果您的负载是一个典型的受缓存命中率和磁盘速度的限制。 那么使用COMPACT行格式可能会更快,如果负载是受到CPU速度的限制,那么使用COMPACT可能会变得更变。

使用COMPACT行格式的InnoDB表有一下特性:

  • 每个索引记录都含有一个5字节的头部,这个头部可能是变长的。投部用于将连续的记录链接在一起,同样还用于行级锁。

  • 记录头的可变长部分含有一个位向量来表示NULL列。 如果索引中可以被设置为NULL的列的数量是N,那么这个位向量占用 最大(N/8)字节。(举个例子,如果有9 到 15个列可以设置为NULL,那么,这个位向量使用2字节。) 为 NULL的列不占用空间,除了这个位向量。记录头的可变长部分还包含可变长度列的长度。每个长度占用1或2个字节。具体的是取决于列的最大长度。如果索引中的所有列都是 NOT NULL ,并且具有固定的长度,那么记录头就没有可变长部分。

  • 对于每个非-NULL 的变长字段。记录头用1或2个字节来包含列的长度。只有当列的部分被储存在溢出页的外部,或者最大长度超过了255个字节,并且实际长度超过了127个字节时,才需要2个字节。对于一个外部储存的列,2个字节表示内部储存部分的长度,加上指向外部储存部分的20个字节的指针。内部是768字节,因此长度就是768+20。20字节的指针储存你额的真实长度。

  • 记录头后面的是非-NULL 列的数据内容。

  • 聚集索引的记录包含所有用户定义的列。此外,还有一个6字节的事务ID字段和一个7字节的滚动指针字段。

  • 如果没有给表定义主键。那么每个聚集索引记录还包含一个6字节的行ID字段。

  • 每个辅助索引记录还包含所有为聚集索引定义,而这些主键字段不在辅助索引中的主键字段。如果这些主键字段中任何一个是变长,那么辅助索引的记录头会有一个可变长的部分来记录它们的长度,即使,辅助索引是定义在固定长度的列上。

  • 在内部,对于非可变长度的字符集设置,InnoDB以固定长度的格式储存诸如CHAR(10)等固定长度的字符列。

    InnoDB不会截断VARCHAR列中尾部的空格。

  • 在内部,对于可变长的字符集,如 utf8mb3utf8mb4, 那么 InnoDB将试图通过调整尾部的空间来将 CHAR(N)存储在N 字节中。 如果这个 CHAR(N)列值的长度超过了 N 字节。那么 InnoDB 会将尾部的空间修整为最小的列值字节长度。CHAR(N)列的最大字符字节长度是 × N

    InnoDBCHAR(N)保留最少N个字节。 在很多情况下,保留最小空间 N可以在不造成索引页碎片的情况下实现列更新。相比之下,对于 ROW_FORMAT=REDUNDANTCHAR(N) 列占用最大的字符长度 × N

    InnoDB 将长度大于或等于768字节的固定长度字段编码为可变长度字段,这样可以储存在下一个页。例如: 如果字符集的最大字节长度大于3,那么一个CHAR(255) 列可以超过768字节。就与 utf8mb4一样。

    ROW_FORMAT=DYNAMICROW_FORMAT=COMPRESSED 处理 CHAR 储存的方式和ROW_FORMAT=COMPACT一样。

15.8.1.3 移动或复制InnoDB表

本节描述的技术是移动或复制某些或全部 InnoDB表到不同的服务器或示例。例如您可能把整个MySQL实例移到一个更大型更快的服务器上;您可能克隆整个MySQL实例来创建一个slave服务器;您可能复制独立的表到另一个实例上,以便进行应用程序的开发和测试,或复制到数据仓库以便生成报表。

在Windows系统上, 因为InnoDB 通常内部是用小写储存库名和表名。所以,当以二进制格式将库从Unix移动到Windows,或从Windows移动到Unix上,最好是使用小写的名称创建所有的库名和表名。实现这个目标的一种简单的方法就是,在创建任何库和表之前,在您的my.cnfmy.ini文件中的 [mysqld] 下添加下面的行:

[mysqld]
lower_case_table_names=1

移动或复制InnoDB的技术包括:

传输表空间

可移动表空间功能使用FLUSH TABLES ... FOR EXPORT来准备InnoDB表,以便从一个服务器实例复制到另一个服务器实例。想要是使用这个特性, InnoDB表必须是使用独立表空间的。

MySQL 企业级备份

使用MySQL企业级备份产品可以让您在MySQL数据库运行时进行备份,包括InnoDBMyISAM 表,以对操作最小的影响的同时生成一个数据库一致性快照。当MySQL企业级备份在InnoDB表时,能继续对 InnoDBMyISAM 表进行读写。在复制 MyISAM和其它非InnoDB表期间,允许对表进行读(但不是写)。此外MySQL企业级备份可以对InnoDB表创建压缩的备份文件和备份子集。这样与MySQL二进制日志一起使用,您就可以执行基于时间点恢复。

更多有关MySQL企业级备份的详细信息,参阅 29.2 节, “MySQL企业级备份概述”

复制数据文件(冷备方式)

您可以通过复制15.17.1, “InnoDB备份”中的“冷备份”下列出的所有相关文件来移动InnoDB数据库。

MyISAM 数据文件一样,InnoDB数据和日志文件在所有具有相同浮点数格式的平台都是二进制兼容的。如果浮点数格式不同,而您的表中又没有使用 FLOATDOUBLE数据类型,那么流程就一样,单纯的复制相关的文件即可。

当您移动弄或复制表(使用独立表空间)的.ibd文件时,目标系统的数据库的目录名必须要和源端的相同。表的定义和库名都是存储在 InnoDB 共享表空间,存储在表空间文件事务ID和日志序列数也会因为库的不同而不同。

使用RENAME TABLE语句,来移动一个.ibd文件和相关的表从一个库到另一个库:

RENAME TABLE db1.tbl_name TO db2.tbl_name;

如果您有一个.ibd文件的干净的备份,您可以使用下面的步骤,将其还原到安装的MySQL上:

  1. 自复制.ibd文件后,表没有被删掉过或被截断过,因为如果做过这些操作,那么表在表空间中的ID会变化。

  2. 执行ALTER TABLE语句,来删除当前的 .ibd文件:

    ALTER TABLE tbl_name DISCARD TABLESPACE;
    
  3. 复制备份的.ibd文件到恰当的库目录下。

  4. 执行ALTER TABLE语句,来告诉 InnoDB去为表使用新的.ibd文件:

    ALTER TABLE tbl_name IMPORT TABLESPACE;
    
    注意

    ALTER TABLE ... IMPORT TABLESPACE 特性不会强制对导入的数据进行外键约束。

上面说的:一个干净的 .ibd备份文件,需要满足以下要求之一:

  • .ibd文件,没有未提交修改的事务。

  • .ibd文件,没有未合并插入的缓冲条目。

  • Purge已经从.ibd文件移除了所有被标记删除的索引记录。

  • mysqld 已经从缓冲池中将所有.ibd的修改页flush到文件中。

您可以使用下面的方式创建一个干净的.ibd文件备份:

  1. 停止mysqld的所有活动,并且提交所有的事务。

  2. 等到SHOW ENGINE INNODB STATUS显示库中已经没有活动的事务,然后InnoDB主线程的状态是 Waiting for server activity,然后您就可以创建一个 .ibd 文件副本。

另一个方式就是使用MySQL企业级备份产品。

  1. 使用MySQL企业备份来备份安装的InnoDB。

  2. 在备份时,另起一个mysqld服务,让这个服务为备份清理.ibd文件。

导出和导入(mysqldump)

您可以使用mysqldump从一个机器上dump您的表,然后导入dump文件到其它机器上。使用这种方式,就可以考虑数据的格式是否不同。

提升这种方式的性能的方法就是,在导入数据时,关闭 autocommit。假设表空间具有足够的空间用于导入事务生成的大回滚段, 仅在导入整个表格或表格的一部分之后执行提交。

15.8.1.4 MyISAM转换为InnoDB

If you have MyISAM tables that you want to convert to InnoDB for better reliability and scalability, review the following guidelines and tips before converting.

Note

Partitioned MyISAM tables created in previous versions of MySQL are not compatible with MySQL 8.0. Such tables must be prepared prior to upgrade, either by removing the partitioning, or by converting them to InnoDB. See Section 22.6.2 节, “Partitioning Limitations Relating to Storage Engines”, for more information.

Adjusting Memory Usage for MyISAM and InnoDB

As you transition away from MyISAM tables, lower the value of the key_buffer_size configuration option to free memory no longer needed for caching results. Increase the value of the innodb_buffer_pool_size configuration option, which performs a similar role of allocating cache memory for InnoDB tables. The InnoDB buffer pool caches both table data and index data, speeding up lookups for queries and keeping query results in memory for reuse. For guidance regarding buffer pool size configuration, see 8.12.3.1, “How MySQL Uses Memory”.

On a busy server, run benchmarks with the query cache turned off. The InnoDB buffer pool provides similar benefits, so the query cache might be tying up memory unnecessarily. For information about the query cache, see 8.10.3 节, “The MySQL Query Cache”.

Handling Too-Long Or Too-Short Transactions

Because MyISAM tables do not support transactions, you might not have paid much attention to the autocommit configuration option and the COMMIT and ROLLBACK statements. These keywords are important to allow multiple sessions to read and write InnoDB tables concurrently, providing substantial scalability benefits in write-heavy workloads.

While a transaction is open, the system keeps a snapshot of the data as seen at the beginning of the transaction, which can cause substantial overhead if the system inserts, updates, and deletes millions of rows while a stray transaction keeps running. Thus, take care to avoid transactions that run for too long:

  • If you are using a mysql session for interactive experiments, always COMMIT (to finalize the changes) or ROLLBACK (to undo the changes) when finished. Close down interactive sessions rather than leave them open for long periods, to avoid keeping transactions open for long periods by accident.

  • Make sure that any error handlers in your application also ROLLBACK incomplete changes or COMMIT completed changes.

  • ROLLBACK is a relatively expensive operation, because INSERT, UPDATE, and DELETE operations are written to InnoDB tables prior to the COMMIT, with the expectation that most changes are committed successfully and rollbacks are rare. When experimenting with large volumes of data, avoid making changes to large ints of rows and then rolling back those changes.

  • When loading large volumes of data with a sequence of INSERT statements, periodically COMMIT the results to avoid having transactions that last for hours. In typical load operations for data warehousing, if something goes wrong, you truncate the table (using TRUNCATE TABLE) and start over from the beginning rather than doing a ROLLBACK.

The preceding tips save memory and disk space that can be wasted during too-long transactions. When transactions are shorter than they should be, the problem is excessive I/O. With each COMMIT, MySQL makes sure each change is safely recorded to disk, which involves some I/O.

  • For most operations on InnoDB tables, you should use the setting autocommit=0. From an efficiency perspective, this avoids unnecessary I/O when you issue large ints of consecutive INSERT, UPDATE, or DELETE statements. From a safety perspective, this allows you to issue a ROLLBACK statement to recover lost or garbled data if you make a mistake on the mysql command line, or in an exception handler in your application.

  • The time when autocommit=1 is suitable for InnoDB tables is when running a sequence of queries for generating reports or analyzing statistics. In this situation, there is no I/O penalty related to COMMIT or ROLLBACK, and InnoDB can automatically optimize the read-only workload.

  • If you make a series of related changes, finalize all the changes at once with a single COMMIT at the end. For example, if you insert related pieces of information into several tables, do a single COMMIT after making all the changes. Or if you run many consecutive INSERT statements, do a single COMMIT after all the data is loaded; if you are doing millions of INSERT statements, perhaps split up the huge transaction by issuing a COMMIT every ten thousand or hundred thousand records, so the transaction does not grow too large.

  • Remember that even a SELECT statement opens a transaction, so after running some report or debugging queries in an interactive mysql session, either issue a COMMIT or close the mysql session.

Handling Deadlocks

You might see warning messages referring to deadlocks in the MySQL error log, or the output of SHOW ENGINE INNODB STATUS. Despite the scary-sounding name, a deadlock is not a serious issue for InnoDB tables, and often does not require any corrective action. When two transactions start modifying multiple tables, accessing the tables in a different order, they can reach a state where each transaction is waiting for the other and neither can proceed. When deadlock detection is enabled (the default), MySQL immediately detects this condition and cancels (rolls back) the smaller transaction, allowing the other to proceed. If deadlock detection is disabled using the innodb_deadlock_detect configuration option, InnoDB relies on the innodb_lock_wait_timeout setting to roll back transactions in case of a deadlock.

Either way, your applications need error-handling logic to restart a transaction that is forcibly cancelled due to a deadlock. When you re-issue the same SQL statements as before, the original timing issue no longer applies. Either the other transaction has already finished and yours can proceed, or the other transaction is still in progress and your transaction waits until it finishes.

If deadlock warnings occur constantly, you might review the application code to reorder the SQL operations in a consistent way, or to shorten the transactions. You can test with the innodb_print_all_deadlocks option enabled to see all deadlock warnings in the MySQL error log, rather than only the last warning in the SHOW ENGINE INNODB STATUS output.

For more information, see 15.5.5 节, “Deadlocks in InnoDB”.

Planning the Storage Layout

To get the best performance from InnoDB tables, you can adjust a int of parameters related to storage layout.

When you convert MyISAM tables that are large, frequently accessed, and hold vital data, investigate and consider the innodb_file_per_table and innodb_page_size configuration options, and the ROW_FORMAT and KEY_BLOCK_SIZE clauses of the CREATE TABLE statement.

During your initial experiments, the most important setting is innodb_file_per_table. When this setting is enabled, which is the default, new InnoDB tables are implicitly created in file-per-table tablespaces. In contrast with the InnoDB system tablespace, file-per-table tablespaces allow disk space to be reclaimed by the operating system when a table is truncated or dropped. File-per-table tablespaces also support DYNAMIC and COMPRESSED row formats and associated features such as table compression, efficient off-page storage for long variable-length columns, and large index prefixes. For more information, see 15.7.4 节, “InnoDB File-Per-Table Tablespaces”.

You can also store InnoDB tables in a shared general tablespace, which support multiple tables and all row formats. For more information, see 15.7.9 节, “InnoDB General Tablespaces”.

Converting an Existing Table

To convert a non-InnoDB table to use InnoDB use ALTER TABLE:

ALTER TABLE table_name ENGINE=InnoDB;
Cloning the Structure of a Table

You might make an InnoDB table that is a clone of a MyISAM table, rather than using ALTER TABLE to perform conversion, to test the old and new table side-by-side before switching.

Create an empty InnoDB table with identical column and index definitions. Use SHOW CREATE TABLE table_name\G to see the full CREATE TABLE statement to use. Change the ENGINE clause to ENGINE=INNODB.

Transferring Existing Data

To transfer a large volume of data into an empty InnoDB table created as shown in the previous section, insert the rows with INSERT INTO innodb_table SELECT * FROM myisam_table ORDER BY primary_key_columns.

You can also create the indexes for the InnoDB table after inserting the data. Historically, creating new secondary indexes was a slow operation for InnoDB, but now you can create the indexes after the data is loaded with relatively little overhead from the index creation step.

If you have UNIQUE constraints on secondary keys, you can speed up a table import by turning off the uniqueness checks temporarily during the import operation:

SET unique_checks=0;
... import operation ...
SET unique_checks=1;

For big tables, this saves disk I/O because InnoDB can use its change buffer to write secondary index records as a batch. Be certain that the data contains no duplicate keys. unique_checks permits but does not require storage engines to ignore duplicate keys.

For better control over the insertion process, you can insert big tables in pieces:

INSERT INTO newtable SELECT * FROM oldtable
   WHERE yourkey > something AND yourkey <= somethingelse;

After all records are inserted, you can rename the tables.

During the conversion of big tables, increase the size of the InnoDB buffer pool to reduce disk I/O, to a maximum of 80% of physical memory. You can also increase the size of InnoDB log files.

存储要求

If you intend to make several temporary copies of your data in InnoDB tables during the conversion process, it is recommended that you create the tables in file-per-table tablespaces so that you can reclaim the disk space when you drop the tables. When the innodb_file_per_table configuration option is enabled (the default), newly created InnoDB tables are implicitly created in file-per-table tablespaces.

Whether you convert the MyISAM table directly or create a cloned InnoDB table, make sure that you have sufficient disk space to hold both the old and new tables during the process. InnoDB tables require more disk space than MyISAM tables. If an ALTER TABLE operation runs out of space, it starts a rollback, and that can take hours if it is disk-bound. For inserts, InnoDB uses the insert buffer to merge secondary index records to indexes in batches. That saves a lot of disk I/O. For rollback, no such mechanism is used, and the rollback can take 30 times longer than the insertion.

In the case of a runaway rollback, if you do not have valuable data in your database, it may be advisable to kill the database process rather than wait for millions of disk I/O operations to complete. For the complete procedure, see 15.20.2 节, “Forcing InnoDB Recovery”.

为每个表定义一个主键

主键子句是影响MySQL查询性能和表与索引对空间使用的关键因素。主键唯一的标识了表中的行,表中的每一行都必须有一个主键值,且没有哪两行的主键值是相同的。

这些是主键的指导方针,后面还有更详细的解释。

  • 为每个表定义一个主键。通常情况,在查找单行时,它是您在WHERE 子句中引用的最重要的列。

  • 在原始的CREATE TABLE语句中定义主键,而不是后面使用ALTER TABLE语句修改。

  • 选择列时,要注意它的数据类型,数值型列好过字符型后字符串型。

  • 如果没有其它稳定、唯一、非空、数值型的列可以使用,可以考虑使用自动增长列。

  • 如果不确定主键列的值是否会发生变化,那么一个自动增长的列会是不错的选择。更改主键列的值是一个项开销非常大的操作。可能涉及重新排列表中和每个辅助索引中的数据。

考虑想任何还没有主键的表添加主键。根据表的最大规划大小,用最小的实际数值类型。这可以使用行之间稍微紧凑一些,从而为大型表节省大量的空间。如果表有其它辅助索引,那么就会成倍的节省空间,因为主键值在每个辅助索引条目中是重复的。除了节省磁盘的使用,一个小的主键还可以让更多的数据放入缓冲池中,加快各种操作并提高并发性。

如果表上已经在一个长一点的列上有一个主键,如:一个VARCHAR类型的列,考虑添加一个没有标识的AUTO_INCREMENT列,然后切换主键,即使这个列不会被查询引用。这种设计可以大量的节省辅助索引占用的空间。您可以将主键列指定为 UNIQUE NOT NULL,来强制某些约束为主键 子句,这样就防止所有这些列中出现重复值或空值。

如果您将相关的数据分离到多个表中,通常情况下,每个表都使用相同的列作为自己的主键。举个例子, 一个人事库可能有多个表,每个表都以数值型雇员列为主键 。一个销售库可能有多以数值型客户列为主键的表,其它表是以数值型订单列为主键。因为使用主键查询非常快,所以您可以对这些表构建高效的连接查询。

如果完全不是用主键,MySQL将为您创建一个隐含的,它是一个6字节的值,可能比您需要的长,从而浪费了空间。因为它是隐含的,所以您不能在查询中引用它。

应用程序性能注意事项

The reliability and scalability features of InnoDB require more disk storage than equivalent MyISAM tables. You might change the column and index definitions slightly, for better space utilization, reduced I/O and memory consumption when processing result sets, and better query optimization plans making efficient use of index lookups.

If you do set up a numeric ID column for the primary key, use that value to cross-reference with related values in any other tables, particularly for join queries. For example, rather than accepting a country name as input and doing queries searching for the same name, do one lookup to determine the country ID, then do other queries (or a single join query) to look up relevant information across several tables. Rather than storing a customer or catalog item int as a string of digits, potentially using up several bytes, convert it to a numeric ID for storing and querying. A 4-byte unsigned INT column can index over 4 billion items (with the US meaning of billion: 1000 million). For the ranges of the different integer types, see 11.2.1, “Integer Types (Exact Value) - INTEGER, INT, SMALLINT, TINYINT, MEDIUMINT, BIGINT”.

理解与InnoDB表相关联的文件

InnoDB文件比MyISAM文件需要更多的关注和规划。

15.8.1.5 AUTO_INCREMENT处理

InnoDB provides a configurable locking mechanism that can significantly improve scalability and performance of SQL statements that add rows to tables with AUTO_INCREMENT columns. To use the AUTO_INCREMENT mechanism with an InnoDB table, an AUTO_INCREMENT column must be defined as part of an index such that it is possible to perform the equivalent of an indexed SELECT MAX(ai_col) lookup on the table to obtain the maximum column value. Typically, this is achieved by making the column the first column of some table index.

This section describes the behavior of AUTO_INCREMENT lock modes, usage implications for different AUTO_INCREMENT lock mode settings, and how InnoDB initializes the AUTO_INCREMENT counter.

InnoDB AUTO_INCREMENT锁模式

This section describes the behavior of AUTO_INCREMENT lock modes used to generate auto-increment values, and how each lock mode affects replication. Auto-increment lock modes are configured at startup using the innodb_autoinc_lock_mode configuration parameter.

The following terms are used in describing innodb_autoinc_lock_mode settings:

  • INSERT-like statements

    All statements that generate new rows in a table, including INSERT, INSERT ... SELECT, REPLACE, REPLACE ... SELECT, and LOAD DATA. Includes simple-inserts, bulk-inserts, and mixed-mode inserts.

  • Simple inserts

    Statements for which the int of rows to be inserted can be determined in advance (when the statement is initially processed). This includes single-row and multiple-row INSERT and REPLACE statements that do not have a nested subquery, but not INSERT ... ON DUPLICATE KEY UPDATE.

  • Bulk inserts

    Statements for which the int of rows to be inserted (and the int of required auto-increment values) is not known in advance. This includes INSERT ... SELECT, REPLACE ... SELECT, and LOAD DATA statements, but not plain INSERT. InnoDB assigns new values for the AUTO_INCREMENT column one at a time as each row is processed.

  • Mixed-mode inserts

    These are simple insert statements that specify the auto-increment value for some (but not all) of the new rows. An example follows, where c1 is an AUTO_INCREMENT column of table t1:

    INSERT INTO t1 (c1,c2) VALUES (1,'a'), (NULL,'b'), (5,'c'), (NULL,'d');
    

    Another type of mixed-mode insert is INSERT ... ON DUPLICATE KEY UPDATE, which in the worst case is in effect an INSERT followed by a UPDATE, where the allocated value for the AUTO_INCREMENT column may or may not be used during the update phase.

There are three possible settings for the innodb_autoinc_lock_mode configuration parameter. The settings are 0, 1, or 2, for traditional, consecutive, or interleaved lock mode, respectively.

  • innodb_autoinc_lock_mode = 0 (traditional lock mode)

    The traditional lock mode provides the same behavior that existed before the innodb_autoinc_lock_mode configuration parameter was introduced in MySQL 5.1. The traditional lock mode option is provided for backward compatibility, performance testing, and working around issues with “mixed-mode inserts”, due to possible differences in semantics.

    In this lock mode, all INSERT-like statements obtain a special table-level AUTO-INC lock for inserts into tables with AUTO_INCREMENT columns. This lock is normally held to the end of the statement (not to the end of the transaction) to ensure that auto-increment values are assigned in a predictable and repeatable order for a given sequence of INSERT statements, and to ensure that auto-increment values assigned by any given statement are consecutive.

    In the case of statement-based replication, this means that when an SQL statement is replicated on a slave server, the same values are used for the auto-increment column as on the master server. The result of execution of multiple INSERT statements is deterministic, and the slave reproduces the same data as on the master. If auto-increment values generated by multiple INSERT statements were interleaved, the result of two concurrent INSERT statements would be nondeterministic, and could not reliably be propagated to a slave server using statement-based replication.

    To make this clear, consider an example that uses this table:

    CREATE TABLE t1 (
      c1 INT(11) NOT NULL AUTO_INCREMENT,
      c2 VARCHAR(10) DEFAULT NULL,
      PRIMARY KEY (c1)
    ) ENGINE=InnoDB;
    

    Suppose that there are two transactions running, each inserting rows into a table with an AUTO_INCREMENT column. One transaction is using an INSERT ... SELECT statement that inserts 1000 rows, and another is using a simple INSERT statement that inserts one row:

    Tx1: INSERT INTO t1 (c2) SELECT 1000 rows from another table ...
    Tx2: INSERT INTO t1 (c2) VALUES ('xxx');
    

    InnoDB cannot tell in advance how many rows are retrieved from the SELECT in the INSERT statement in Tx1, and it assigns the auto-increment values one at a time as the statement proceeds. With a table-level lock, held to the end of the statement, only one INSERT statement referring to table t1 can execute at a time, and the generation of auto-increment ints by different statements is not interleaved. The auto-increment value generated by the Tx1 INSERT ... SELECT statement are consecutive, and the (single) auto-increment value used by the INSERT statement in Tx2 are either smaller or larger than all those used for Tx1, depending on which statement executes first.

    As long as the SQL statements execute in the same order when replayed from the binary log (when using statement-based replication, or in recovery scenarios), the results are the same as they were when Tx1 and Tx2 first ran. Thus, table-level locks held until the end of a statement make INSERT statements using auto-increment safe for use with statement-based replication. However, those table-level locks limit concurrency and scalability when multiple transactions are executing insert statements at the same time.

    In the preceding example, if there were no table-level lock, the value of the auto-increment column used for the INSERT in Tx2 depends on precisely when the statement executes. If the INSERT of Tx2 executes while the INSERT of Tx1 is running (rather than before it starts or after it completes), the specific auto-increment values assigned by the two INSERT statements are nondeterministic, and may vary from run to run.

    Under the consecutive lock mode, InnoDB can avoid using table-level AUTO-INC locks for simple insert statements where the int of rows is known in advance, and still preserve deterministic execution and safety for statement-based replication.

    If you are not using the binary log to replay SQL statements as part of recovery or replication, the interleaved lock mode can be used to eliminate all use of table-level AUTO-INC locks for even greater concurrency and performance, at the cost of permitting gaps in auto-increment ints assigned by a statement and potentially having the ints assigned by concurrently executing statements interleaved.

  • innodb_autoinc_lock_mode = 1 (consecutive lock mode)

    This is the default lock mode. In this mode, bulk inserts use the special AUTO-INC table-level lock and hold it until the end of the statement. This applies to all INSERT ... SELECT, REPLACE ... SELECT, and LOAD DATA statements. Only one statement holding the AUTO-INC lock can execute at a time. If the source table of the bulk insert operation is different from the target table, the AUTO-INC lock on the target table is taken after a shared lock is taken on the first row selected from the source table. If the source and target of the bulk insert operation are the same table, the AUTO-INC lock is taken after shared locks are taken on all selected rows.

    Simple inserts (for which the int of rows to be inserted is known in advance) avoid table-level AUTO-INC locks by obtaining the required int of auto-increment values under the control of a mutex (a light-weight lock) that is only held for the duration of the allocation process, not until the statement completes. No table-level AUTO-INC lock is used unless an AUTO-INC lock is held by another transaction. If another transaction holds an AUTO-INC lock, a simple insert waits for the AUTO-INC lock, as if it were a bulk insert.

    This lock mode ensures that, in the presence of INSERT statements where the int of rows is not known in advance (and where auto-increment ints are assigned as the statement progresses), all auto-increment values assigned by any INSERT-like statement are consecutive, and operations are safe for statement-based replication.

    Simply put, this lock mode significantly improves scalability while being safe for use with statement-based replication. Further, as with traditional lock mode, auto-increment ints assigned by any given statement are consecutive. There is no change in semantics compared to traditional mode for any statement that uses auto-increment, with one important exception.

    The exception is for mixed-mode inserts, where the user provides explicit values for an AUTO_INCREMENT column for some, but not all, rows in a multiple-row simple insert. For such inserts, InnoDB allocates more auto-increment values than the int of rows to be inserted. However, all values automatically assigned are consecutively generated (and thus higher than) the auto-increment value generated by the most recently executed previous statement. Excess ints are lost.

  • innodb_autoinc_lock_mode = 2 (interleaved lock mode)

    In this lock mode, no INSERT-like statements use the table-level AUTO-INC lock, and multiple statements can execute at the same time. This is the fastest and most scalable lock mode, but it is not safe when using statement-based replication or recovery scenarios when SQL statements are replayed from the binary log.

    In this lock mode, auto-increment values are guaranteed to be unique and monotonically increasing across all concurrently executing INSERT-like statements. However, because multiple statements can be generating ints at the same time (that is, allocation of ints is interleaved across statements), the values generated for the rows inserted by any given statement may not be consecutive.

    If the only statements executing are simple inserts where the int of rows to be inserted is known ahead of time, there are no gaps in the ints generated for a single statement, except for mixed-mode inserts. However, when bulk inserts are executed, there may be gaps in the auto-increment values assigned by any given statement.

InnoDB AUTO_INCREMENT Lock Mode Usage Implications
  • Using auto-increment with replication

    If you are using statement-based replication, set innodb_autoinc_lock_mode to 0 or 1 and use the same value on the master and its slaves. Auto-increment values are not ensured to be the same on the slaves as on the master if you use innodb_autoinc_lock_mode = 2 (interleaved) or configurations where the master and slaves do not use the same lock mode.

    If you are using row-based or mixed-format replication, all of the auto-increment lock modes are safe, since row-based replication is not sensitive to the order of execution of the SQL statements (and the mixed format uses row-based replication for any statements that are unsafe for statement-based replication).

  • Lost auto-increment values and sequence gaps

    In all lock modes (0, 1, and 2), if a transaction that generated auto-increment values rolls back, those auto-increment values are lost. Once a value is generated for an auto-increment column, it cannot be rolled back, whether or not the INSERT-like statement is completed, and whether or not the containing transaction is rolled back. Such lost values are not reused. Thus, there may be gaps in the values stored in an AUTO_INCREMENT column of a table.

  • Specifying NULL or 0 for the AUTO_INCREMENT column

    In all lock modes (0, 1, and 2), if a user specifies NULL or 0 for the AUTO_INCREMENT column in an INSERT, InnoDB treats the row as if the value was not specified and generates a new value for it.

  • Assigning a negative value to the AUTO_INCREMENT column

    In all lock modes (0, 1, and 2), the behavior of the auto-increment mechanism is not defined if you assign a negative value to the AUTO_INCREMENT column.

  • If the AUTO_INCREMENT value becomes larger than the maximum integer for the specified integer type

    In all lock modes (0, 1, and 2), the behavior of the auto-increment mechanism is not defined if the value becomes larger than the maximum integer that can be stored in the specified integer type.

  • Gaps in auto-increment values for bulk inserts

    With innodb_autoinc_lock_mode set to 0 (traditional) or 1 (consecutive), the auto-increment values generated by any given statement are consecutive, without gaps, because the table-level AUTO-INC lock is held until the end of the statement, and only one such statement can execute at a time.

    With innodb_autoinc_lock_mode set to 2 (interleaved), there may be gaps in the auto-increment values generated by bulk inserts, but only if there are concurrently executing INSERT-like statements.

    For lock modes 1 or 2, gaps may occur between successive statements because for bulk inserts the exact int of auto-increment values required by each statement may not be known and overestimation is possible.

  • Auto-increment values assigned by mixed-mode inserts

    Consider a mixed-mode insert, where a simple insert specifies the auto-increment value for some (but not all) resulting rows. Such a statement behaves differently in lock modes 0, 1, and 2. For example, assume c1 is an AUTO_INCREMENT column of table t1, and that the most recent automatically generated sequence int is 100.

    mysql> CREATE TABLE t1 (
        -> c1 INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, 
        -> c2 CHAR(1)
        -> ) ENGINE = INNODB;

    Now, consider the following mixed-mode insert statement:

    mysql> INSERT INTO t1 (c1,c2) VALUES (1,'a'), (NULL,'b'), (5,'c'), (NULL,'d');
    

    With innodb_autoinc_lock_mode set to 0 (traditional), the four new rows are:

    mysql> SELECT c1, c2 FROM t1 ORDER BY c2;
    +-----+------+
    | c1  | c2   |
    +-----+------+
    |   1 | a    |
    | 101 | b    |
    |   5 | c    |
    | 102 | d    |
    +-----+------+
    

    The next available auto-increment value is 103 because the auto-increment values are allocated one at a time, not all at once at the beginning of statement execution. This result is true whether or not there are concurrently executing INSERT-like statements (of any type).

    With innodb_autoinc_lock_mode set to 1 (consecutive), the four new rows are also:

    mysql> SELECT c1, c2 FROM t1 ORDER BY c2;
    +-----+------+
    | c1  | c2   |
    +-----+------+
    |   1 | a    |
    | 101 | b    |
    |   5 | c    |
    | 102 | d    |
    +-----+------+
    

    However, in this case, the next available auto-increment value is 105, not 103 because four auto-increment values are allocated at the time the statement is processed, but only two are used. This result is true whether or not there are concurrently executing INSERT-like statements (of any type).

    With innodb_autoinc_lock_mode set to mode 2 (interleaved), the four new rows are:

    mysql> SELECT c1, c2 FROM t1 ORDER BY c2;
    +-----+------+
    | c1  | c2   |
    +-----+------+
    |   1 | a    |
    |   x | b    |
    |   5 | c    |
    |   y | d    |
    +-----+------+
    

    The values of x and y are unique and larger than any previously generated rows. However, the specific values of x and y depend on the int of auto-increment values generated by concurrently executing statements.

    Finally, consider the following statement, issued when the most-recently generated sequence int was the value 4:

    mysql> INSERT INTO t1 (c1,c2) VALUES (1,'a'), (NULL,'b'), (5,'c'), (NULL,'d');
    

    With any innodb_autoinc_lock_mode setting, this statement generates a duplicate-key error 23000 (Can't write; duplicate key in table) because 5 is allocated for the row (NULL, 'b') and insertion of the row (5, 'c') fails.

  • Modifying AUTO_INCREMENT column values in the middle of a sequence of INSERT statements

    In MySQL 5.7 and earlier, modifying an AUTO_INCREMENT column value in the middle of a sequence of INSERT statements could lead to Duplicate entry errors. For example, if you performed an UPDATE operation that changed an AUTO_INCREMENT column value to a value larger than the current maximum auto-increment value, subsequent INSERT operations that did not specify an unused auto-increment value could encounter Duplicate entry errors. In MySQL 8.0 and later, if you modify an AUTO_INCREMENT column value to a value larger than the current maximum auto-increment value, the new value is persisted, and subsequent INSERT operations allocate auto-increment values starting from the new, larger value. This behavior is demonstrated in the following example.

    mysql> CREATE TABLE t1 (
        -> c1 INT NOT NULL AUTO_INCREMENT,
        -> PRIMARY KEY (c1)
        ->  ) ENGINE = InnoDB;
    
    mysql> INSERT INTO t1 VALUES(0), (0), (3);
    
    mysql> SELECT c1 FROM t1;
    +----+
    | c1 |
    +----+
    |  1 |
    |  2 |
    |  3 |
    +----+
    
    mysql> UPDATE t1 SET c1 = 4 WHERE c1 = 1;
    
    mysql> SELECT c1 FROM t1;
    +----+
    | c1 |
    +----+
    |  2 |
    |  3 |
    |  4 |
    +----+
    
    mysql> INSERT INTO t1 VALUES(0);
    
    mysql> SELECT c1 FROM t1;
    +----+
    | c1 |
    +----+
    |  2 |
    |  3 |
    |  4 |
    |  5 |
    +----+ 
InnoDB AUTO_INCREMENT计数器初始化

This section describes how InnoDB initializes AUTO_INCREMENT counters.

If you specify an AUTO_INCREMENT column for an InnoDB table, the in-memory table object contains a special counter called the auto-increment counter that is used when assigning new values for the column.

In MySQL 5.7 and earlier, the auto-increment counter is stored only in main memory, not on disk. To initialize an auto-increment counter after a server restart, InnoDB would execute the equivalent of the following statement on the first insert into a table containing an AUTO_INCREMENT column.

SELECT MAX(ai_col) FROM table_name FOR UPDATE;

In MySQL 8.0, this behavior is changed. The current maximum auto-increment counter value is written to the redo log each time it changes and is saved to an engine-private system table on each checkpoint. These changes make the current maximum auto-increment counter value persistent across server restarts.

On a server restart following a normal shutdown, InnoDB initializes the in-memory auto-increment counter using the current maximum auto-increment value stored in the data dictionary system table.

On a server restart during crash recovery, InnoDB initializes the in-memory auto-increment counter using the current maximum auto-increment value stored in the data dictionary system table and scans the redo log for auto-increment counter values written since the last checkpoint. If a redo-logged value is greater than the in-memory counter value, the redo-logged value is applied. However, in the case of a server crash, reuse of a previously allocated auto-increment value cannot be guaranteed. Each time the current maximum auto-increment value is changed due to an INSERT or UPDATE operation, the new value is written to the redo log, but if the crash occurs before the redo log is flushed to disk, the previously allocated value could be reused when the auto-increment counter is initialized after the server is restarted.

The only circumstance in which InnoDB uses the equivalent of a SELECT MAX(ai_col) FROM table_name FOR UPDATE statement in MySQL 8.0 and later to initialize an auto-increment counter is when importing a tablespace without a .cfg metadata file. Otherwise, the current maximum auto-increment counter value is read from the .cfg metadata file.

In MySQL 5.7 and earlier, a server restart cancels the effect of the AUTO_INCREMENT = N table option, which may be used in a CREATE TABLE or ALTER TABLE statement to set an initial counter value or alter the existing counter value, respectively. In MySQL 8.0, a server restart does not cancel the effect of the AUTO_INCREMENT = N table option. If you initialize the auto-increment counter to a specific value, or if you alter the auto-increment counter value to a larger value, the new value is persisted across server restarts.

Note

ALTER TABLE ... AUTO_INCREMENT = N can only change the auto-increment counter value to a value larger than the current maximum.

In MySQL 5.7 and earlier, a server restart immediately following a ROLLBACK operation could result in the reuse of auto-increment values that were previously allocated to the rolled-back transaction, effectively rolling back the current maximum auto-increment value. In MySQL 8.0, the current maximum auto-increment value is persisted, preventing the reuse of previously allocated values.

If a SHOW TABLE STATUS statement examines a table before the auto-increment counter is initialized, InnoDB opens the table and initializes the counter value using the current maximum auto-increment value that is stored in the data dictionary system table. The value is stored in memory for use by later inserts or updates. Initialization of the counter value uses a normal exclusive-locking read on the table which lasts to the end of the transaction. InnoDB follows the same procedure when initializing the auto-increment counter for a newly created table that has a user-specified auto-increment value that is greater than 0.

After the auto-increment counter is initialized, if you do not explicitly specify an auto-increment value when inserting a row, InnoDB implicitly increments the counter and assigns the new value to the column. If you insert a row that explicitly specifies an auto-increment column value, and the value is greater than the current maximum counter value, the counter is set to the specified value.

InnoDB uses the in-memory auto-increment counter as long as the server runs. When the server is stopped and restarted, InnoDB reinitializes the auto-increment counter, as described earlier.

auto_increment_offset 配置选项是确定AUTO_INCREMENT列的起点,默认值是1。

auto_increment_increment 配置选项控制两个连续值之间的间隔,默认是1。

15.8.1.6 InnoDB和外键约束

下面介绍的内容是,InnoDB存储引擎如何处理外键约束:

对外键使用的信息和示例,请参阅 13.1.15.6 节, “使用外键约束”

外键定义

InnoDB表的外键定义受以下条件的影响:

  • InnoDB允许一个外键引用任何索引列或一组列。但是,被引用的表,必须有一个索引,其中被引用列 要以相同的顺序作为索引的第一个列。

  • InnoDB目前不支持自定义分区表的外键。这就意味着,不是自定义分区的 InnoDB表可以含有外键引用,或列被外键引用。

  • InnoDB 允许外键约束引用非唯一键,这是第标准SQL的InnoDB扩展。

引用动作

InnoDB表外键的引用操作要遵守以下条件:

  • 虽然MySQL服务器允许SET DEFAULT ,但是 InnoDB会以其无效进行拒绝,所以在CREATE TABLEALTER TABLE InnoDB表时,语句中不允许使用这样的子句。

  • 如果在父表中有多行具有相同的引用键值,那么 InnoDB在进行外键值检查时,就好像其他具有相同键值的父行不存在一样。 举个例子,如果您已经定义了一个RESTRICT 类型约束。现在有一个子行,而这它有多个父行, 那么InnoDB不会允许删除它的任何父行。

  • InnoDB通过对应于外键约束的索引中的记录,通过深度优先算法执行级联操作。

  • 如果ON UPDATE CASCADEON UPDATE SET NULL 递归更新的表,与它之前在级联期间更新过的表相同。那么它就跟 RESTRICT一样。 这就意味着,您不能使用自引用的 ON UPDATE CASCADEON UPDATE SET NULL 操作。这是为了防止级联更新导致的无限循环。另一方面,可以自引用 ON DELETE SET NULL, 就像自引用 ON DELETE CASCADE 一样。级联操作可能不会嵌套超过15层。

  • 一般来说,MySQL in general, 一个SQL语句插入、删除、或 更新很多行时,InnoDB会一行一行的检查 UNIQUEFOREIGN KEY 约束。当执行外键检查时,InnoDB 在被检查的子或父记录上设置共享行级锁, InnoDB会立刻检查外键约束,这个检查不会延迟到事务提交。根据SQL标准,默认的是应该延迟检查,这是因为· checks foreign key constraints immediately; the check is not deferred to transaction commit. According to the SQL standard, the default behavior should be deferred checking. That is, constraints are only checked after the entire SQL statement has been processed. Until InnoDB implements deferred constraint checking, some things are impossible, such as deleting a record that refers to itself using a foreign key.

Foreign Key Restrictions for Generated Columns and Virtual Indexes
  • 程序生成的列 上引用外键约束,不能使用ON UPDATE CASCADE, ON DELETE SET NULLON UPDATE SET NULL, ON DELETE SET DEFAULT、或 ON UPDATE SET DEFAULT

  • 外键约束不能引用一个 虚拟生成的列

  • 在MySQL 8.0之前, 不能引用一个虚拟生成的列上定义的辅助索引。

外键的使用和错误信息

查询INFORMATION_SCHEMA.KEY_COLUMN_USAGE表可以获得关于外键的一般信息和使用情况 ,更多特定于InnoDB表的信息,可以在 INNODB_SYS_FOREIGNINNODB_SYS_FOREIGN_COLS 表中找到,同样还可以在INFORMATION_SCHEMA库中。

In addition to SHOW ERRORS, in the event of a foreign key error involving InnoDB tables (usually Error 150 in the MySQL Server), you can obtain a detailed explanation of the most recent InnoDB foreign key error by checking the output of SHOW ENGINE INNODB STATUS.

15.8.1.7 InnoDB表的限制

下面主题是描述InnoDB 表上的限制:

警告

InnoDB使用NFS前,回顾 MySQL使用NFS中描述的潜在问题。

最大值和最小值
  • 一个表最多可以有1017列。虚拟生成的列也在这个限制内。

  • 一个表最多可以含有64个辅助索引。

  • 使用DYNAMICCOMPRESSED行格式的InnoDB表的一个索引前缀的长度的限制是3072字节。

    使用REDUNDANTCOMPACT行格式的InnoDB表的一个索引前缀的长度的限制是767字节。 举个例子。您可能碰到过这样的限制, 在TEXTVARCHAR列上, 索引列前缀超过了191字符, 假设字符集是utf8mb4,那么每个字符最大4个字节。

    试图超过限制长度的索引键前缀,会返回一个错误。

    这个限制不仅适用于索引键前缀,还适用于全列索引键。

  • 如果您将在创建MySQL实例时,使用innodb_page_size选项,将 InnoDBpage size缩小到8KB 或 4KB 。那么索引键的最大长度会成比例的缩小。3072字节的限制是一句16KB的页,也就是说,页是8KB时,索引键的最大长度是1536字节。页是4KB时,是768字节。

  • 一个复合索引最多16个列。超过了这个限制返回一个错误。

    ERROR 1070 (42000): Too many key parts specified; max 16 parts allowed
    
  • 除了可变长列 (VARBINARYVARCHARBLOBTEXT),行的最大长度要略小于页大小的一半。举个例子,如果innodb_page_size是默认的16k,那么行的最大长度就是约8000字节,对于一个InnoDB页的大小是64KB,那么行的最大长度约16000字节。 LONGBLOBLONGTEXT列必须小于4GB,并且行的总长度,包括 BLOBTEXT 列,必须小于4GB。

    如果一行的长度小于页大小的一半,那么就可以全部存储在这个页内,如果超过了页的一半,可变长度的列会选择下一个页来存储,直到行是在页的一半。正如 15.11.2 节, “文件空间管理”中描述的。

  • 尽管InnoDB支持在内部大于65,535个字节的行,但是MySQL本身的行的限制为65,535,这是所有列的组合大小:

    mysql> CREATE TABLE t (a VARCHAR(8000), b VARCHAR(10000),
        -> c VARCHAR(10000), d VARCHAR(10000), e VARCHAR(10000),
        -> f VARCHAR(10000), g VARCHAR(10000)) ENGINE=InnoDB;
    ERROR 1118 (42000): Row size too large. The maximum row size for the
    used table type, not counting BLOBs, is 65535. You have to change some
    columns to TEXT or BLOBs
    
  • 在某些老的操作系统上,文件的大小必须小于2G,这并不是InnoDB本身的限制,但是如果您需要一个大的表空间,可以将其配置为使用多个小的数据文件,而不是一个大的数据文件。

  • InnoDB日志文件的总大小可以高达512GB

  • 最小表空间大小略大于10MB。表空间最大大小取决于InnoDB的页大小。

    表 15.6 InnoDB表空间最大大小

    InnoDB 页大小最大表空间大小
    4KB16TB
    8KB32TB
    16KB64TB
    32KB128TB
    64KB256TB

    表空间的最大大小也是表的最大大小。

  • InnoDB页的默认大小是16KB,在创建MySQL实例时,可以配置选项 innodb_page_size来增加或减小页的大小。

    支持32KB 和 64KB 的页,但是对于大于16KB的页不支持 ROW_FORMAT=COMPRESSED。对于32KB和64KB的页,记录的大小最大是16KB。 对于innodb_page_size=32k, 一个区(extent)的大小是2MB。对于 innodb_page_size=64k, 一个区(extent)的大小是4MB。

    使用特定大小的InnoDB页的MySQL实例,不能使用来自不同大小页的实例的数据文件或日志文件。

InnoDB表的限制
  • ANALYZE TABLE决定索引基数(显示在SHOW INDEX输出的 Cardinality列),通过执行在每个索引树执行random dives,并相应的更新索引基数的估值。由于这些只是估计值,所以重复执行ANALYZE TABLE会产生不同的数值。这使得ANALYZE TABLE可以快速的在InnoDB上执行,但是不是100%的准确。因为它没有考虑到所有的行。

    您可以通过打开innodb_stats_persistent配置选项,您可以使ANALYZE TABLE收集的统计信息更加精确和稳定。正如在 15.6.11.1 节, “配置持久优化器统计参数”中描述的,当启用了改设置时,当索引列的数据有大的变更后,重新运行ANALYZE TABLE是重要的,因此,统计信息会不定期的重新计算(如,重启服务器之后)。

    如果启用了持久统计设置,您可以通过修改 innodb_stats_persistent_sample_pages系统变量,来改变随机取样的数量,如果这个设置被禁用了,那么可以修改 innodb_stats_transient_sample_pages系统变量来代替。

    MySQL在连接优化中使用索引基数估值。如果一个连接不是以正确的方式进行优化,请尝试 ANALYZE TABLE。在少数情况下, ANALYZE TABLE不会产生足够好的值,那么您可以在您的查询语句中使用FORCE INDEX来强制使用一个特殊的索引。或者设置 max_seeks_for_key 系统变量来确保MySQL执行索引查找优先与表扫描。参阅 B.5.5 节, “优化器相关问题”

  • 如果一个表上正在运行语句或事务,并对表进行ANALYZE TABLE操作,然后又对表执行一个 ANALYZE TABLE操作,那么第二个 ANALYZE TABLE操作将被阻塞,知道语句或事务完成。这个情况的出现,是因为 ANALYZE TABLE完成时,将当前加载的表定义标记为已过时,那么新语句或事务(包括第二个 ANALYZE TABLE语句)必须加载新的表定义到表缓存中,而此时,就需要等到当前正在运行的语句或事务已完成,且旧的表定义被清理了,才能继续进行。不支持加载并发加载多个表定义。

  • 除了表中保留的物理大小之外,SHOW TABLE STATUS不会给出InnoDB表准确的统计信息。行计数仅是在SQL优化中粗略的估计。

  • InnoDB 内部不会保存表中行的数量,因此并发事务在同一时间可能 看到 不同的行数。因此,SELECT COUNT(*)语句只能统计当前事务看得到的行。

    InnoDB 执行 SELECT COUNT(*) 语句是通过扫描聚集索引。

    如果所以记录不是完全都在缓冲池中,那么处理SELECT COUNT(*)语句就需要花费一些时间。为了更快的计数。您可以创建一个计数表,让应用程序根据插入和删除来更新它,但是,如果数千个并发事务对同一个计数表进行更新的话,那么这种方法就不具备伸缩性了。如果一个近似值就足够了,那么可以使用 SHOW TABLE STATUS

    InnoDB 处理 SELECT COUNT(*)SELECT COUNT(1)操作是用相同的方式,他们之间没有性能差异。

  • 在 Windows上,InnoDB内部通常使用小写保存库名和表名。所以使用二进制格式将数据库从Unix上移动到Windows上,或者从Windows上到Unix上,所有的库名和表名都要使用小写。

  • 必须将AUTO_INCREMENTai_col定义为索引的一部分,以至于 SELECT MAX(ai_col)可以通过表上的索引查找, 以获得列的最大值。通常,这是通过将列设置为某个索引的第一列来实现的。

  • 当初始化指定AUTO_INCREMENT列的表时,InnoDBAUTO_INCREMENT列相关联的索引末尾设置排它锁。

    innodb_autoinc_lock_mode=0时, InnoDB使用一个特殊的AUTO-INC表锁模式,在访问自增计数器时,获得锁,并将其保存在当前SQL语句的末尾,当AUTO-INC表锁被持有时,其他客户端就不能对表进行插入。 当innodb_autoinc_lock_mode=1时,进行批量插入也会出现这个情形。 表级锁AUTO-INC锁定不与innodb_autoinc_lock_mode=2一起使用。 更多信息,请参阅 15.8.1.5 节, “InnoDB中的AUTO_INCREMENT处理”

  • 当一个AUTO_INCREMENT整数列的值用完了,随后的INSERT操作返回一个重复键错误,这通常是MySQL做的,类似于 MyISAM工作方式。

  • DELETE FROM tbl_name不会重新生成表,而是一行一行的,删除所有行。

  • 级联的外键行为不会激活触发器。

  • 您创建的表中不能有与InnoDB内部列同名的列 (包括 DB_ROW_ID, DB_TRX_ID, DB_ROLL_PTR, 和 DB_MIX_ID),这个限制不管大小写都一样。

    mysql> CREATE TABLE t1 (c1 INT, db_row_id INT) ENGINE=INNODB;
    ERROR 1166 (42000): Incorrect column name 'db_row_id'
    
锁定和事务
  • 如果innodb_table_locks=1(默认的),那么 LOCK TABLES 在每个表上获得两个锁。除了对MySQL层的表锁以外,还需要在 InnoDB层获得表锁,而在在4.1.2之前MySQL不需要获得InnoDB层的表锁。可以通过设置 innodb_table_locks=0来使用旧的方式。如果没有获得InnoDB层的表锁,那么即使表中的某些记录被其他事务锁定,也会进行 LOCK TABLES

    在MySQL 8.0, innodb_table_locks=0 has no effect for tables locked explicitly with 对使用LOCK TABLES ... WRITE显式的锁定表没有影响。. LOCK TABLES ... WRITE(如,通过触发器) 或 LOCK TABLES ... READ有影响。

  • 当事务被提及或中止时,所有被事务持有的InnoDB锁都被释放。因此, 在autocommit=1模式下,在InnoDB表调用LOCK TABLES是没有什么意义的,因为所获取的InnoDB表锁将会立即被释放。

  • 您不能在事务的中间锁定其他表,因为 LOCK TABLES执行隐式的 COMMITUNLOCK TABLES

  • 数据修改事务的限制是96 * 1023个并发事务。生成undo记录。128个回滚段中的32个被分配给用于修改临时表和相关对象的事务的非重做日志。这意味着并发数据修改事务的最大值是96K。 96K的限制是假设事务不修改临时表,如果所有数据修改事务中有修改临时表的,那么并发事务的限制就是32K。

15.8.2 InnoDB索引

本节讨论与InnoDB索引相关的主题。

15.8.2.1 聚集索引和辅助索引

每个InnoDB表都有一个指定的索引,这个索引叫做聚集索引,它存储的是行的数据。通常情况下,聚集索引跟主键是同义词。为了获得更好的查询、插入及其他数据库操作性能,您必须理解 InnoDB是如何使用聚集索引来优化对表最常见的查找和DML操作。

  • 当您在您的表上定义一个主键时,InnoDB使用它作为聚集索引。给您创建的每个表定义个主键,如果没有逻辑唯一和非空的列或一组列,那么就添加一个新的 auto-increment列,这个列的值自动填充。

  • 如果您没有为表定义主键,那么MySQL将把NOT NULL的列定位为第一个唯一索引,并且 InnoDB使用它作为聚集索引。

  • 如果表没有主键或合适的UNIQUE索引, InnoDB 内部会在一个包含行ID的合成列上生成一个隐藏的聚集索引。这些行将按照InnoDB分配给表中行的ID进行排序,行ID是一个6字节的字段。随着新行的插入而单调递增。因此,行ID的顺序在物理上是插入的顺序。

聚集索引如何加速查询

通过聚集索引访问一行是非常快,因为索引搜索直接指向整个行数据所在的页。如果表很大, 相比于索引记录使用不同页来存储行数据的存储组织,聚集的索引结构通常节省磁盘I/O。(如,MyISAM使用一个文件存数据行,一个文件存索引记录。)

辅助索引如何与聚集索引相关

除聚集索引之外的所有索引都是辅助索引,在InnoDB中,一个辅助索引的每个记录都包含了该行主键列,以及指定的列。 InnoDB 使用主键值来搜索聚集索引中的行。

如果主键很长,那么辅助索引会使用更多的空间,因此,短的主键更有利。

对于使用InnoDB聚集索引和辅助索引的指导原则,请参阅 8.3 节, “优化和索引”

15.8.2.2 InnoDB索引的物理结构

除了空间索引外,InnoDB的索引都是B-tree 数据结构。空间索引使用的是 R-trees,这是专门用于检索多维数据的数据结构。索引记录被存储在B-tree或B-tree的叶子页上,索引页的默认大小是16KB。

当新记录插入到一个InnoDB 聚集索引中, InnoDB尝试将页的 1/16 空闲出来,以供后期的索引记录的插入和更新。如果插入的索引记录是以连续的顺序(升序或降序),那么结果就是索引页使用量大约15/16,如果是随机顺序插入,那么索引页的使用量为 1/2 到 15/16 。

InnoDB在创建或重新B-树索引时执行批量加载。这种创建索引的方法被称为构建排序的索引。配置选项 innodb_fill_factor定义,在构建排序的索引期间每个B-树页的使用量百分比。其余的空间被保留,用于未来的索引增长。空间索引不支持构建排序的索引。更多信息,参阅 15.8.2.3 节, “构建排序的索引”innodb_fill_factor设置为100,意思就是,保留聚集索引页1/16的空间给后期的索引增长使用。

如果InnoDB索引页的使用量低于 MERGE_THRESHOLD(如果没有指定,那么默认是50%),那么InnoDB缩小索引树来释放页, MERGE_THRESHOLD 设置适用于B-树 和 R-树索引。更多信息,参阅 15.6.12, “配置索引页合并阈值”

您可以在初始化MySQL实例之前,通过设置配置选项innodb_page_size来为MySQL实例中的所有InnoDB表空间定义页的大小。一旦实例的页的大小被定义了,就不能在不重新初始化实例的情况下对其进行更改。支持的页的大小有 64KB、32KB、16KB(默认的), 8KB 和 4KB,对应的选项值是 64k32k16k8k4k

使用特定大小的InnoDB页的MySQL实例,不能使用其他不同大小页实例中的数据文件或日志文件。

15.8.2.3 构建排序的索引

InnoDB performs a bulk load instead of inserting one index record at a time when creating or rebuilding indexes. This method of index creation is also known as a sorted index build. Sorted index builds are not supported for spatial indexes.

一个索引的构建,有三个阶段。第一阶段,扫描 clustered index 然后,生成索引条目,并且索引条目被添加到排序缓冲池,当 sort buffer满了,索引条目会被立刻写入一个临时中间文件。这个过程就是众所周知的 run。在第二阶段,伴随着一个或多个runs被写入临时中间文件,对文件的所有条目上执行一个合并排序。 在第三个和最后一个阶段中,排序的条目被插入到b树中。 B-tree.

Prior to the introduction of sorted index builds, index entries were inserted into the B-tree one record at a time using insert APIs. This method involved opening a B-tree cursor to find the insert position and then inserting entries into a B-tree page using an optimistic insert. If an insert failed due to a page being full, a pessimistic insert would be performed, which involves opening a B-tree cursor and splitting and merging B-tree nodes as necessary to find space for the entry. The drawbacks of this top-down method of building an index are the cost of searching for an insert position and the constant splitting and merging of B-tree nodes.

Sorted index builds use a bottom-up approach to building an index. With this approach, a reference to the right-most leaf page is held at all levels of the B-tree. The right-most leaf page at the necessary B-tree depth is allocated and entries are inserted according to their sorted order. Once a leaf page is full, a node pointer is appended to the parent page and a sibling leaf page is allocated for the next insert. This process continues until all entries are inserted, which may result in inserts up to the root level. When a sibling page is allocated, the reference to the previously pinned leaf page is released, and the newly allocated leaf page becomes the right-most leaf page and new default insert location.

Reserving B-tree Page Space for Future Index Growth

To set aside space for future index growth, you can use the innodb_fill_factor configuration option to reserve a percentage of B-tree page space. For example, setting innodb_fill_factor to 80 reserves 20 percent of the space in B-tree pages during a sorted index build. This setting applies to both B-tree leaf and non-leaf pages. It does not apply to external pages used for TEXT or BLOB entries. The amount of space that is reserved may not be exactly as configured, as the innodb_fill_factor value is interpreted as a hint rather than a hard limit.

Sorted Index Builds and Full-Text Index Support

Sorted index builds are supported for fulltext indexes. Previously, SQL was used to insert entries into a fulltext index.

Sorted Index Builds and Compressed Tables

For compressed tables, the previous index creation method appended entries to both compressed and uncompressed pages. When the modification log (representing free space on the compressed page) became full, the compressed page would be recompressed. If compression failed due to a lack of space, the page would be split. With sorted index builds, entries are only appended to uncompressed pages. When an uncompressed page becomes full, it is compressed. Adaptive padding is used to ensure that compression succeeds in most cases, but if compression fails, the page is split and compression is attempted again. This process continues until compression is successful. For more information about compression of B-Tree pages, see 15.9.1.5 节, “How Compression Works for InnoDB Tables”.

Sorted Index Builds and Redo Logging

Redo logging is disabled during a sorted index build. Instead, there is a checkpoint to ensure that the index build can withstand a crash or failure. The checkpoint forces a write of all dirty pages to disk. During a sorted index build, the page cleaner thread is signaled periodically to flush dirty pages to ensure that the checkpoint operation can be processed quickly. Normally, the page cleaner thread flushes dirty pages when the int of clean pages falls below a set threshold. For sorted index builds, dirty pages are flushed promptly to reduce checkpoint overhead and to parallelize I/O and CPU activity.

Sorted Index Builds and Optimizer Statistics

Sorted index builds may result in optimizer statistics that differ from those generated by the previous method of index creation. The difference in statistics, which is not expected to affect workload performance, is due to the different algorithm used to populate the index.

15.8.2.4 InnoDB全文索引

全文索引是基于文本列 (CHARVARCHARTEXT列) 建立的,用于帮助加快对这些列的数据的查询和DML操作。 省略任何被定义为stopwords的词。

全文索引的定义可以作为CREATE TABLE语句的一部分,也可以对一个存在的表,使用 ALTER TABLECREATE INDEX语句添加全文索引。

全文搜索的执行是使用MATCH() ... AGAINST 语法,对于使用说明,请参阅 12.9 节, “全文搜索函数”

下面小节描述的就是 InnoDB 全文索引:

InnoDB全文索引设计

InnoDB FULLTEXT索引有一个反向索引设计,反向的索引存储着关键词列表。并且对于每个词,都列出了次出现词的文档。 为了支持邻近搜索,还会以字节偏移量来保存每个词的位置信息。

InnoDB全文索引表

每个InnoDB 全文索引,都会创建一组索引表。如下例所示:

mysql> CREATE TABLE opening_lines (
       id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
       opening_line TEXT(500),
       author VARCHAR(200),
       title VARCHAR(200),
       FULLTEXT idx (opening_line)
       ) ENGINE=InnoDB;

mysql> SELECT table_id, name, space from INFORMATION_SCHEMA.INNODB_SYS_TABLES
       WHERE name LIKE 'test/%';
+----------+----------------------------------------------------+-------+
| table_id | name                                               | space |
+----------+----------------------------------------------------+-------+
|      333 | test/fts_0000000000000147_00000000000001c9_index_1 |   289 |
|      334 | test/fts_0000000000000147_00000000000001c9_index_2 |   290 |
|      335 | test/fts_0000000000000147_00000000000001c9_index_3 |   291 |
|      336 | test/fts_0000000000000147_00000000000001c9_index_4 |   292 |
|      337 | test/fts_0000000000000147_00000000000001c9_index_5 |   293 |
|      338 | test/fts_0000000000000147_00000000000001c9_index_6 |   294 |
|      330 | test/fts_0000000000000147_being_deleted            |   286 |
|      331 | test/fts_0000000000000147_being_deleted_cache      |   287 |
|      332 | test/fts_0000000000000147_config                   |   288 |
|      328 | test/fts_0000000000000147_deleted                  |   284 |
|      329 | test/fts_0000000000000147_deleted_cache            |   285 |
|      327 | test/opening_lines                                 |   283 |
+----------+----------------------------------------------------+-------+ 

The first six tables represent the inverted index and are referred to as auxiliary index tables. When incoming documents are tokenized, the individual words (also referred to as tokens) are inserted into the index tables along with position information and the associated Document ID (DOC_ID). The words are fully sorted and partitioned among the six index tables based on the character set sort weight of the word's first character.

The inverted index is partitioned into six auxiliary index tables to support parallel index creation. 默认情况下, two threads tokenize, sort, and insert words and associated data into the index tables. The int of threads is configurable using the innodb_ft_sort_pll_degree option. Consider increasing the int of threads when creating FULLTEXT indexes on large tables.

Auxiliary index table names are prefixed with fts_ and postfixed with index_*. Each index table is associated with the indexed table by a hex value in the index table name that matches the table_id of the indexed table. For example, the table_id of the test/opening_lines table is 327, for which the hex value is 0x147. As shown in the preceding example, the 147 hex value appears in the names of index tables that are associated with the test/opening_lines table.

A hex value representing the index_id of the FULLTEXT index also appears in auxiliary index table names. For example, in the auxiliary table name test/fts_0000000000000147_00000000000001c9_index_1, the hex value 1c9 has a decimal value of 457. The index defined on the opening_lines table (idx) can be identified by querying the INFORMATION_SCHEMA.INNODB_SYS_INDEXES table for this value (457).

mysql> SELECT index_id, name, table_id, space from INFORMATION_SCHEMA.INNODB_SYS_INDEXES
       WHERE index_id=457;
+----------+------+----------+-------+
| index_id | name | table_id | space |
+----------+------+----------+-------+
|      457 | idx  |      327 |   283 |
+----------+------+----------+-------+

Index tables are stored in their own tablespace if the primary table is created in a file-per-table tablespace.

The other index tables shown in the preceding example are used for deletion handling and for storing the internal state of the FULLTEXT index.

  • fts_*_deleted and fts_*_deleted_cache

    Contain the document IDs (DOC_ID) for documents that are deleted but whose data is not yet removed from the full-text index. The fts_*_deleted_cache is the in-memory version of the fts_*_deleted table.

  • fts_*_being_deleted and fts_*_being_deleted_cache

    Contain the document IDs (DOC_ID) for documents that are deleted and whose data is currently in the process of being removed from the full-text index. The fts_*_being_deleted_cache table is the in-memory version of the fts_*_being_deleted table.

  • fts_*_config

    Stores information about the internal state of the FULLTEXT index. Most importantly, it stores the FTS_SYNCED_DOC_ID, which identifies documents that have been parsed and flushed to disk. In case of crash recovery, FTS_SYNCED_DOC_ID values are used to identify documents that have not been flushed to disk so that the documents can be re-parsed and added back to the FULLTEXT index cache. To view the data in this table, query the INFORMATION_SCHEMA.INNODB_FT_CONFIG table.

InnoDB全文索引缓存

When a document is inserted, it is tokenized, and the individual words and associated data are inserted into the FULLTEXT index. This process, even for small documents, could result in numerous small insertions into the auxiliary index tables, making concurrent access to these tables a point of contention. To avoid this problem, InnoDB uses a FULLTEXT index cache to temporarily cache index table insertions for recently inserted rows. This in-memory cache structure holds insertions until the cache is full and then batch flushes them to disk (to the auxiliary index tables). You can query the INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE table to view tokenized data for recently inserted rows.

The caching and batch flushing behavior avoids frequent updates to auxiliary index tables, which could result in concurrent access issues during busy insert and update times. The batching technique also avoids multiple insertions for the same word, and minimizes duplicate entries. Instead of flushing each word individually, insertions for the same word are merged and flushed to disk as a single entry, improving insertion efficiency while keeping auxiliary index tables as small as possible.

The innodb_ft_cache_size variable is used to configure the full-text index cache size (on a per-table basis), which affects how often the full-text index cache is flushed. You can also define a global full-text index cache size limit for all tables in a given instance using the innodb_ft_total_cache_size option.

The full-text index cache stores the same information as auxiliary index tables. However, the full-text index cache only caches tokenized data for recently inserted rows. The data that is already flushed to disk (to the full-text auxiliary tables) is not brought back into the full-text index cache when queried. The data in auxiliary index tables is queried directly, and results from the auxiliary index tables are merged with results from the full-text index cache before being returned.

InnoDB Full-Text Index Document ID and FTS_DOC_ID Column

InnoDB uses a unique document identifier referred to as a Document ID (DOC_ID) to map words in the full-text index to document records where the word appears. The mapping requires an FTS_DOC_ID column on the indexed table. If an FTS_DOC_ID column is not defined, InnoDB automatically adds a hidden FTS_DOC_ID column when the full-text index is created. The following example demonstrates this behavior.

The following table definition does not include an FTS_DOC_ID column:

mysql> CREATE TABLE opening_lines (
       id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
       opening_line TEXT(500),
       author VARCHAR(200),
       title VARCHAR(200)
       ) ENGINE=InnoDB;   

When you create a full-text index on the table using CREATE FULLTEXT INDEX syntax, a warning is returned which reports that InnoDB is rebuilding the table to add the FTS_DOC_ID column.

mysql> CREATE FULLTEXT INDEX idx ON opening_lines(opening_line);
Query OK, 0 rows affected, 1 warning (0.19 sec)
Records: 0  Duplicates: 0  Warnings: 1

mysql> SHOW WARNINGS;
+---------+------+--------------------------------------------------+
| Level   | Code | Message                                          |
+---------+------+--------------------------------------------------+
| Warning |  124 | InnoDB rebuilding table to add column FTS_DOC_ID |
+---------+------+--------------------------------------------------+

The same warning is returned when using ALTER TABLE to add a full-text index to a table that does not have an FTS_DOC_ID column. If you create a full-text index at CREATE TABLE time and do not specify an FTS_DOC_ID column, InnoDB adds a hidden FTS_DOC_ID column, without warning.

Defining an FTS_DOC_ID column at CREATE TABLE time is less expensive than creating a full-text index on a table that is already loaded with data. If an FTS_DOC_ID column is defined on a table prior to loading data, the table and its indexes do not have to be rebuilt to add the new column. If you are not concerned with CREATE FULLTEXT INDEX performance, leave out the FTS_DOC_ID column to have InnoDB create it for you. InnoDB creates a hidden FTS_DOC_ID column along with a unique index (FTS_DOC_ID_INDEX) on the FTS_DOC_ID column. If you want to create your own FTS_DOC_ID column, the column must be defined as BIGINT UNSIGNED NOT NULL and named FTS_DOC_ID (all upper case), as in the following example:

Note

The FTS_DOC_ID column does not need to be defined as an AUTO_INCREMENT column, but AUTO_INCREMENT could make loading data easier.

mysql> CREATE TABLE opening_lines (
       FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
       opening_line TEXT(500),
       author VARCHAR(200),
       title VARCHAR(200)
       ) ENGINE=InnoDB;    

If you choose to define the FTS_DOC_ID column yourself, you are responsible for managing the column to avoid empty or duplicate values. FTS_DOC_ID values cannot be reused, which means FTS_DOC_ID values must be ever increasing.

Optionally, you can create the required unique FTS_DOC_ID_INDEX (all upper case) on the FTS_DOC_ID column.

mysql> CREATE UNIQUE INDEX FTS_DOC_ID_INDEX on opening_lines(FTS_DOC_ID);

If you do not create the FTS_DOC_ID_INDEX, InnoDB creates it automatically.

Note

FTS_DOC_ID_INDEX cannot be defined as a descending index because the InnoDB SQL parser does not use descending indexes.

The permitted gap between the largest used FTS_DOC_ID value and new FTS_DOC_ID value is 65535.

InnoDB全文索引删除处理

删除在具有全文索引的列的记录,可能会导致辅助索引表中的许多小的删除操作,使得这些表的并发性访问成为争用的焦点。为了避免此问题,被删除的文档的文档ID (DOC_ID)被记录在一个特殊的FTS_*_DELETED表。从索引表中删除记录时,全文索引仍保留索引记录。在返回查询结果钱,使用 FTS_*_DELETED表中的记录信息,来过滤掉已被删除的文件ID。这种设计的好处就是删除速度快,成本低。缺点就是删除记录后,索引的大小不会立刻缩小。要删除全文索引条目中已经删除的记录,请使用innodb_optimize_fulltext_only=ON,并对索引所在的表运行OPTIMIZE TABLE 以重建全文索引。更多信息,请参阅 优化InnoDB全文索引

InnoDB全文索引事务处理

InnoDB FULLTEXT索引由于其缓存和批处理行为而具有特殊的事务处理特性。具体而言,全文索引上的更新和插入是在事务提交时处理,这意味着,全文搜索只能看到已提交的数据,下面演示这种行为。 全文搜索仅在提交插入的行后才返回结果。

mysql> CREATE TABLE opening_lines (
       id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
       opening_line TEXT(500),
       author VARCHAR(200),
       title VARCHAR(200),
       FULLTEXT idx (opening_line)
       ) ENGINE=InnoDB;

mysql> BEGIN;

mysql> INSERT INTO opening_lines(opening_line,author,title) VALUES
       ('Call me Ishmael.','Herman Melville','Moby-Dick'),
       ('A screaming comes across the sky.','Thomas Pynchon','Gravity\'s Rainbow'),
       ('I am an invisible man.','Ralph Ellison','Invisible Man'),
       ('Where now? Who now? When now?','Samuel Beckett','The Unnamable'),
       ('It was love at first sight.','Joseph Heller','Catch-22'),
       ('All this happened, more or less.','Kurt Vonnegut','Slaughterhouse-Five'),
       ('Mrs. Dalloway said she would buy the flowers herself.','Virginia Woolf','Mrs. Dalloway'),
       ('It was a pleasure to burn.','Ray Bradbury','Fahrenheit 451');

mysql> SELECT COUNT(*) FROM opening_lines WHERE MATCH(opening_line) AGAINST('Ishmael');
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+

mysql> COMMIT;

mysql> SELECT COUNT(*) FROM opening_lines WHERE MATCH(opening_line) AGAINST('Ishmael');
+----------+
| COUNT(*) |
+----------+
|        1 |
+----------+
监控InnoDB全文索引

您可以通过查询以下INFORMATION_SCHEMA表来监控和检查InnoDB 全文索引的特殊文本处理方面:

您还可以通过查询 INNODB_SYS_INDEXESINNODB_SYS_TABLES来查看全文索引和表的基本信息。

更多信息,请参阅 15.14.4 节, “InnoDB INFORMATION_SCHEMA全文索引表”

15.9 InnoDB 表和页压缩

This section provides information about the InnoDB table compression and InnoDB page compression features. The page compression feature is also referred to as transparent page compression.

Using the compression features of InnoDB, you can create tables where the data is stored in compressed form. Compression can help to improve both raw performance and scalability. The compression means less data is transferred between disk and memory, and takes up less space on disk and in memory. The benefits are amplified for tables with secondary indexes, because index data is compressed also. Compression can be especially important for SSD storage devices, because they tend to have lower capacity than HDD devices.

15.9.1 InnoDB表压缩

This section describes InnoDB table compression, which is supported with InnoDB tables that reside in file_per_table tablespaces or general tablespaces. Table compression is enabled using the ROW_FORMAT=COMPRESSED attribute with CREATE TABLE or ALTER TABLE.

15.9.1.1 Overview of Table Compression

Because processors and cache memories have increased in speed more than disk storage devices, many workloads are disk-bound. Data compression enables smaller database size, reduced I/O, and improved throughput, at the small cost of increased CPU utilization. Compression is especially valuable for read-intensive applications, on systems with enough RAM to keep frequently used data in memory.

An InnoDB table created with ROW_FORMAT=COMPRESSED can use a smaller page size on disk than the configured innodb_page_size value. Smaller pages require less I/O to read from and write to disk, which is especially valuable for SSD devices.

The compressed page size is specified through the CREATE TABLE or ALTER TABLE KEY_BLOCK_SIZE parameter. The different page size requires that the table be placed in a file-per-table tablespace or general tablespace rather than in the system tablespace, as the system tablespace cannot store compressed tables. For more information, see 15.7.4 节, “InnoDB File-Per-Table Tablespaces”, and 15.7.9 节, “InnoDB General Tablespaces”.

The level of compression is the same regardless of the KEY_BLOCK_SIZE value. As you specify smaller values for KEY_BLOCK_SIZE, you get the I/O benefits of increasingly smaller pages. But if you specify a value that is too small, there is additional overhead to reorganize the pages when data values cannot be compressed enough to fit multiple rows in each page. There is a hard limit on how small KEY_BLOCK_SIZE can be for a table, based on the lengths of the key columns for each of its indexes. Specify a value that is too small, and the CREATE TABLE or ALTER TABLE statement fails.

In the buffer pool, the compressed data is held in small pages, with a page size based on the KEY_BLOCK_SIZE value. For extracting or updating the column values, MySQL also creates an uncompressed page in the buffer pool with the uncompressed data. Within the buffer pool, any updates to the uncompressed page are also re-written back to the equivalent compressed page. You might need to size your buffer pool to accommodate the additional data of both compressed and uncompressed pages, although the uncompressed pages are evicted from the buffer pool when space is needed, and then uncompressed again on the next access.

15.9.1.2 Creating Compressed Tables

Compressed tables can be created in file-per-table tablespaces or in general tablespaces. Table compression is not available for the InnoDB system tablespace. The system tablespace (space 0, the .ibdata files) can contain user-created tables, but it also contains internal system data, which is never compressed. Thus, compression applies only to tables (and indexes) stored in file-per-table or general tablespaces.

Creating a Compressed Table in File-Per-Table Tablespace

To create a compressed table in a file-per-table tablespace, innodb_file_per_table must be enabled (the default). You can set this parameter in the MySQL configuration file (my.cnf or my.ini) or dynamically, using a SET statement.

After the innodb_file_per_table option is configured, specify the ROW_FORMAT=COMPRESSED clause or KEY_BLOCK_SIZE clause, or both, in a CREATE TABLE or ALTER TABLE statement to create a compressed table in a file-per-table tablespace.

For example, you might use the following statements:

SET GLOBAL innodb_file_per_table=1;
CREATE TABLE t1
 (c1 INT PRIMARY KEY)
 ROW_FORMAT=COMPRESSED
KEY_BLOCK_SIZE=8;
Creating a Compressed Table in a General Tablespace

To create a compressed table in a general tablespace, FILE_BLOCK_SIZE must be defined for the general tablespace, which is specified when the tablespace is created. The FILE_BLOCK_SIZE value must be a valid compressed page size in relation to the innodb_page_size value, and the page size of the compressed table, defined by the CREATE TABLE or ALTER TABLE KEY_BLOCK_SIZE clause, must be equal to FILE_BLOCK_SIZE/1024. For example, if innodb_page_size=16384 and FILE_BLOCK_SIZE=8192, the KEY_BLOCK_SIZE of the table must be 8. For more information, see 15.7.9 节, “InnoDB General Tablespaces”.

The following example demonstrates creating a general tablespace and adding a compressed table. The example assumes a default innodb_page_size of 16K. The FILE_BLOCK_SIZE of 8192 requires that the compressed table have a KEY_BLOCK_SIZE of 8.

mysql> CREATE TABLESPACE `ts2` ADD DATAFILE 'ts2.ibd' FILE_BLOCK_SIZE = 8192 Engine=InnoDB;
Query OK, 0 rows affected (0.01 sec)

mysql> CREATE TABLE t4 (c1 INT PRIMARY KEY) TABLESPACE ts2 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8;
Query OK, 0 rows affected (0.00 sec)
Notes
  • If you specify ROW_FORMAT=COMPRESSED, you can omit KEY_BLOCK_SIZE; the KEY_BLOCK_SIZE setting defaults to half the innodb_page_size value.

  • If you specify a valid KEY_BLOCK_SIZE value, you can omit ROW_FORMAT=COMPRESSED; compression is enabled automatically.

  • To determine the best value for KEY_BLOCK_SIZE, typically you create several copies of the same table with different values for this clause, then measure the size of the resulting .ibd files and see how well each performs with a realistic workload. For general tablespaces, keep in mind that dropping a table does not reduce the size of the general tablespace .ibd file, nor does it return disk space to the operating system. For more information, see 15.7.9 节, “InnoDB General Tablespaces”.

  • The KEY_BLOCK_SIZE value is treated as a hint; a different size could be used by InnoDB if necessary. For file-per-table tablespaces, the KEY_BLOCK_SIZE can only be less than or equal to the innodb_page_size value. If you specify a value greater than the innodb_page_size value, the specified value is ignored, a warning is issued, and KEY_BLOCK_SIZE is set to half of the innodb_page_size value. If innodb_strict_mode=ON, specifying an invalid KEY_BLOCK_SIZE value returns an error. For general tablespaces, valid KEY_BLOCK_SIZE values depend on the FILE_BLOCK_SIZE setting of the tablespace. For more information, see 15.7.9 节, “InnoDB General Tablespaces”.

  • InnoDB supports 32k and 64k page sizes but these page sizes do not support compression. For more information, refer to the innodb_page_size documentation.

  • The default uncompressed size of InnoDB data pages is 16KB. Depending on the combination of option values, MySQL uses a page size of 1KB, 2KB, 4KB, 8KB, or 16KB for the tablespace data file (.ibd file). The actual compression algorithm is not affected by the KEY_BLOCK_SIZE value; the value determines how large each compressed chunk is, which in turn affects how many rows can be packed into each compressed page.

  • When creating a compressed table in a file-per-table tablespace, setting KEY_BLOCK_SIZE equal to the InnoDB page size does not typically result in much compression. For example, setting KEY_BLOCK_SIZE=16 typically would not result in much compression, since the normal InnoDB page size is 16KB. This setting may still be useful for tables with many long BLOB, VARCHAR or TEXT columns, because such values often do compress well, and might therefore require fewer overflow pages as described in 15.9.1.5 节, “How Compression Works for InnoDB Tables”. For general tablespaces, a KEY_BLOCK_SIZE value equal to the InnoDB page size is not permitted. For more information, see 15.7.9 节, “InnoDB General Tablespaces”.

  • All indexes of a table (including the clustered index) are compressed using the same page size, as specified in the CREATE TABLE or ALTER TABLE statement. Table attributes such as ROW_FORMAT and KEY_BLOCK_SIZE are not part of the CREATE INDEX syntax for InnoDB tables, and are ignored if they are specified (although, if specified, they will appear in the output of the SHOW CREATE TABLE statement).

  • For performance-related configuration options, see 15.9.1.3 节, “Tuning Compression for InnoDB Tables”.

Restrictions on Compressed Tables
  • Compressed tables cannot be stored in the InnoDB system tablespace.

  • General tablespaces can contain multiple tables, but compressed and uncompressed tables cannot coexist within the same general tablespace.

  • Compression applies to an entire table and all its associated indexes, not to individual rows, despite the clause name ROW_FORMAT.

  • InnoDB does not support compressed temporary tables. When innodb_strict_mode is enabled (the default), CREATE TEMPORARY TABLE returns errors if ROW_FORMAT=COMPRESSED or KEY_BLOCK_SIZE is specified. If innodb_strict_mode is disabled, warnings are issued and the temporary table is created using a non-compressed row format. The same restrictions apply to ALTER TABLE operations on temporary tables.

15.9.1.3 Tuning Compression for InnoDB Tables

Most often, the internal optimizations described in InnoDB Data Storage and Compression ensure that the system runs well with compressed data. However, because the efficiency of compression depends on the nature of your data, you can make decisions that affect the performance of compressed tables:

  • Which tables to compress.

  • What compressed page size to use.

  • Whether to adjust the size of the buffer pool based on run-time performance characteristics, such as the amount of time the system spends compressing and uncompressing data. Whether the workload is more like a data warehouse (primarily queries) or an OLTP system (mix of queries and DML).

  • If the system performs DML operations on compressed tables, and the way the data is distributed leads to expensive compression failures at runtime, you might adjust additional advanced configuration options.

Use the guidelines in this section to help make those architectural and configuration choices. When you are ready to conduct long-term testing and put compressed tables into production, see 15.9.1.4 节, “Monitoring InnoDB Table Compression at Runtime” for ways to verify the effectiveness of those choices under real-world conditions.

When to Use Compression

In general, compression works best on tables that include a reasonable int of character string columns and where the data is read far more often than it is written. Because there are no guaranteed ways to predict whether or not compression benefits a particular situation, always test with a specific workload and data set running on a representative configuration. Consider the following factors when deciding which tables to compress.

Data Characteristics and Compression

A key determinant of the efficiency of compression in reducing the size of data files is the nature of the data itself. Recall that compression works by identifying repeated strings of bytes in a block of data. Completely randomized data is the worst case. Typical data often has repeated values, and so compresses effectively. Character strings often compress well, whether defined in CHAR, VARCHAR, TEXT or BLOB columns. On the other hand, tables containing mostly binary data (integers or floating point ints) or data that is previously compressed (for example JPEG or PNG images) may not generally compress well, significantly or at all.

You choose whether to turn on compression for each InnoDB table. A table and all of its indexes use the same (compressed) page size. It might be that the primary key (clustered) index, which contains the data for all columns of a table, compresses more effectively than the secondary indexes. For those cases where there are long rows, the use of compression might result in long column values being stored off-page, as discussed in 15.10.3 节, “DYNAMIC and COMPRESSED Row Formats”. Those overflow pages may compress well. Given these considerations, for many applications, some tables compress more effectively than others, and you might find that your workload performs best only with a subset of tables compressed.

To determine whether or not to compress a particular table, conduct experiments. You can get a rough estimate of how efficiently your data can be compressed by using a utility that implements LZ77 compression (such as gzip or WinZip) on a copy of the .ibd file for an uncompressed table. You can expect less compression from a MySQL compressed table than from file-based compression tools, because MySQL compresses data in chunks based on the page size, 16KB by default. In addition to user data, the page format includes some internal system data that is not compressed. File-based compression utilities can examine much larger chunks of data, and so might find more repeated strings in a huge file than MySQL can find in an individual page.

Another way to test compression on a specific table is to copy some data from your uncompressed table to a similar, compressed table (having all the same indexes) in a file-per-table tablespace and look at the size of the resulting .ibd file. For example:

use test;
set global innodb_file_per_table=1;
set global autocommit=0;

-- Create an uncompressed table with a million or two rows.
create table big_table as select * from information_schema.columns;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
insert into big_table select * from big_table;
commit;
alter table big_table add id int unsigned not null primary key auto_increment;

show create table big_table\G

select count(id) from big_table;

-- Check how much space is needed for the uncompressed table.
\! ls -l data/test/big_table.ibd

create table key_block_size_4 like big_table;
alter table key_block_size_4 key_block_size=4 row_format=compressed;

insert into key_block_size_4 select * from big_table;
commit;

-- Check how much space is needed for a compressed table
-- with particular compression settings.
\! ls -l data/test/key_block_size_4.ibd

This experiment produced the following ints, which of course could vary considerably depending on your table structure and data:

-rw-rw----  1 cirrus  staff  310378496 Jan  9 13:44 data/test/big_table.ibd
-rw-rw----  1 cirrus  staff  83886080 Jan  9 15:10 data/test/key_block_size_4.ibd

To see whether compression is efficient for your particular workload:

Database Compression versus Application Compression

Decide whether to compress data in your application or in the table; do not use both types of compression for the same data. When you compress the data in the application and store the results in a compressed table, extra space savings are extremely unlikely, and the double compression just wastes CPU cycles.

Compressing in the Database

When enabled, MySQL table compression is automatic and applies to all columns and index values. The columns can still be tested with operators such as LIKE, and sort operations can still use indexes even when the index values are compressed. Because indexes are often a significant fraction of the total size of a database, compression could result in significant savings in storage, I/O or processor time. The compression and decompression operations happen on the database server, which likely is a powerful system that is sized to handle the expected load.

Compressing in the Application

If you compress data such as text in your application, before it is inserted into the database, You might save overhead for data that does not compress well by compressing some columns and not others. This approach uses CPU cycles for compression and uncompression on the client machine rather than the database server, which might be appropriate for a distributed application with many clients, or where the client machine has spare CPU cycles.

Hybrid Approach

Of course, it is possible to combine these approaches. For some applications, it may be appropriate to use some compressed tables and some uncompressed tables. It may be best to externally compress some data (and store it in uncompressed tables) and allow MySQL to compress (some of) the other tables in the application. As always, up-front design and real-life testing are valuable in reaching the right decision.

Workload Characteristics and Compression

In addition to choosing which tables to compress (and the page size), the workload is another key determinant of performance. If the application is dominated by reads, rather than updates, fewer pages need to be reorganized and recompressed after the index page runs out of room for the per-page modification log that MySQL maintains for compressed data. If the updates predominantly change non-indexed columns or those containing BLOBs or large strings that happen to be stored off-page, the overhead of compression may be acceptable. If the only changes to a table are INSERTs that use a monotonically increasing primary key, and there are few secondary indexes, there is little need to reorganize and recompress index pages. Since MySQL can delete-mark and delete rows on compressed pages in place by modifying uncompressed data, DELETE operations on a table are relatively efficient.

For some environments, the time it takes to load data can be as important as run-time retrieval. Especially in data warehouse environments, many tables may be read-only or read-mostly. In those cases, it might or might not be acceptable to pay the price of compression in terms of increased load time, unless the resulting savings in fewer disk reads or in storage cost is significant.

Fundamentally, compression works best when the CPU time is available for compressing and uncompressing data. Thus, if your workload is I/O bound, rather than CPU-bound, you might find that compression can improve overall performance. When you test your application performance with different compression configurations, test on a platform similar to the planned configuration of the production system.

Configuration Characteristics and Compression

Reading and writing database pages from and to disk is the slowest aspect of system performance. Compression attempts to reduce I/O by using CPU time to compress and uncompress data, and is most effective when I/O is a relatively scarce resource compared to processor cycles.

This is often especially the case when running in a multi-user environment with fast, multi-core CPUs. When a page of a compressed table is in memory, MySQL often uses additional memory, typically 16KB, in the buffer pool for an uncompressed copy of the page. The adaptive LRU algorithm attempts to balance the use of memory between compressed and uncompressed pages to take into account whether the workload is running in an I/O-bound or CPU-bound manner. Still, a configuration with more memory dedicated to the buffer pool tends to run better when using compressed tables than a configuration where memory is highly constrained.

Choosing the Compressed Page Size

The optimal setting of the compressed page size depends on the type and distribution of data that the table and its indexes contain. The compressed page size should always be bigger than the maximum record size, or operations may fail as noted in Compression of B-Tree Pages.

Setting the compressed page size too large wastes some space, but the pages do not have to be compressed as often. If the compressed page size is set too small, inserts or updates may require time-consuming recompression, and the B-tree nodes may have to be split more frequently, leading to bigger data files and less efficient indexing.

Typically, you set the compressed page size to 8K or 4K bytes. Given that the maximum row size for an InnoDB table is around 8K, KEY_BLOCK_SIZE=8 is usually a safe choice.

15.9.1.4 Monitoring InnoDB Table Compression at Runtime

Overall application performance, CPU and I/O utilization and the size of disk files are good indicators of how effective compression is for your application. This section builds on the performance tuning advice from 15.9.1.3 节, “Tuning Compression for InnoDB Tables”, and shows how to find problems that might not turn up during initial testing.

To dig deeper into performance considerations for compressed tables, you can monitor compression performance at runtime using the Information Schema tables described in Example 15.1, “Using the Compression Information Schema Tables”. These tables reflect the internal use of memory and the rates of compression used overall.

The INNODB_CMP table reports information about compression activity for each compressed page size (KEY_BLOCK_SIZE) in use. The information in these tables is system-wide: it summarizes the compression statistics across all compressed tables in your database. You can use this data to help decide whether or not to compress a table by examining these tables when no other compressed tables are being accessed. It involves relatively low overhead on the server, so you might query it periodically on a production server to check the overall efficiency of the compression feature.

The INNODB_CMP_PER_INDEX table reports information about compression activity for individual tables and indexes. This information is more targeted and more useful for evaluating compression efficiency and diagnosing performance issues one table or index at a time. (Because that each InnoDB table is represented as a clustered index, MySQL does not make a big distinction between tables and indexes in this context.) The INNODB_CMP_PER_INDEX table does involve substantial overhead, so it is more suitable for development servers, where you can compare the effects of different workloads, data, and compression settings in isolation. To guard against imposing this monitoring overhead by accident, you must enable the innodb_cmp_per_index_enabled configuration option before you can query the INNODB_CMP_PER_INDEX table.

The key statistics to consider are the int of, and amount of time spent performing, compression and uncompression operations. Since MySQL splits B-tree nodes when they are too full to contain the compressed data following a modification, compare the int of successful compression operations with the int of such operations overall. Based on the information in the INNODB_CMP and INNODB_CMP_PER_INDEX tables and overall application performance and hardware resource utilization, you might make changes in your hardware configuration, adjust the size of the buffer pool, choose a different page size, or select a different set of tables to compress.

If the amount of CPU time required for compressing and uncompressing is high, changing to faster or multi-core CPUs can help improve performance with the same data, application workload and set of compressed tables. Increasing the size of the buffer pool might also help performance, so that more uncompressed pages can stay in memory, reducing the need to uncompress pages that exist in memory only in compressed form.

A large int of compression operations overall (compared to the int of INSERT, UPDATE and DELETE operations in your application and the size of the database) could indicate that some of your compressed tables are being updated too heavily for effective compression. If so, choose a larger page size, or be more selective about which tables you compress.

If the int of successful compression operations (COMPRESS_OPS_OK) is a high percentage of the total int of compression operations (COMPRESS_OPS), then the system is likely performing well. If the ratio is low, then MySQL is reorganizing, recompressing, and splitting B-tree nodes more often than is desirable. In this case, avoid compressing some tables, or increase KEY_BLOCK_SIZE for some of the compressed tables. You might turn off compression for tables that cause the int of compression failures in your application to be more than 1% or 2% of the total. (Such a failure ratio might be acceptable during a temporary operation such as a data load).

15.9.1.5 How Compression Works for InnoDB Tables

This section describes some internal implementation details about compression for InnoDB tables. The information presented here may be helpful in tuning for performance, but is not necessary to know for basic use of compression.

Compression Algorithms

Some operating systems implement compression at the file system level. Files are typically divided into fixed-size blocks that are compressed into variable-size blocks, which easily leads into fragmentation. Every time something inside a block is modified, the whole block is recompressed before it is written to disk. These properties make this compression technique unsuitable for use in an update-intensive database system.

MySQL implements compression with the help of the well-known zlib library, which implements the LZ77 compression algorithm. This compression algorithm is mature, robust, and efficient in both CPU utilization and in reduction of data size. The algorithm is lossless, so that the original uncompressed data can always be reconstructed from the compressed form. LZ77 compression works by finding sequences of data that are repeated within the data to be compressed. The patterns of values in your data determine how well it compresses, but typical user data often compresses by 50% or more.

Unlike compression performed by an application, or compression features of some other database management systems, InnoDB compression applies both to user data and to indexes. In many cases, indexes can constitute 40-50% or more of the total database size, so this difference is significant. When compression is working well for a data set, the size of the InnoDB data files (the file-per-table tablespace or general tablespace .idb files) is 25% to 50% of the uncompressed size or possibly smaller. Depending on the workload, this smaller database can in turn lead to a reduction in I/O, and an increase in throughput, at a modest cost in terms of increased CPU utilization. You can adjust the balance between compression level and CPU overhead by modifying the innodb_compression_level configuration option.

InnoDB Data Storage and Compression

All user data in InnoDB tables is stored in pages comprising a B-tree index (the clustered index). In some other database systems, this type of index is called an index-organized table. Each row in the index node contains the values of the (user-specified or system-generated) primary key and all the other columns of the table.

Secondary indexes in InnoDB tables are also B-trees, containing pairs of values: the index key and a pointer to a row in the clustered index. The pointer is in fact the value of the primary key of the table, which is used to access the clustered index if columns other than the index key and primary key are required. Secondary index records must always fit on a single B-tree page.

The compression of B-tree nodes (of both clustered and secondary indexes) is handled differently from compression of overflow pages used to store long VARCHAR, BLOB, or TEXT columns, as explained in the following sections.

Compression of B-Tree Pages

Because they are frequently updated, B-tree pages require special treatment. It is important to minimize the int of times B-tree nodes are split, as well as to minimize the need to uncompress and recompress their content.

One technique MySQL uses is to maintain some system information in the B-tree node in uncompressed form, thus facilitating certain in-place updates. For example, this allows rows to be delete-marked and deleted without any compression operation.

In addition, MySQL attempts to avoid unnecessary uncompression and recompression of index pages when they are changed. Within each B-tree page, the system keeps an uncompressed modification log to record changes made to the page. Updates and inserts of small records may be written to this modification log without requiring the entire page to be completely reconstructed.

When the space for the modification log runs out, InnoDB uncompresses the page, applies the changes and recompresses the page. If recompression fails (a situation known as a compression failure), the B-tree nodes are split and the process is repeated until the update or insert succeeds.

To avoid frequent compression failures in write-intensive workloads, such as for OLTP applications, MySQL sometimes reserves some empty space (padding) in the page, so that the modification log fills up sooner and the page is recompressed while there is still enough room to avoid splitting it. The amount of padding space left in each page varies as the system keeps track of the frequency of page splits. On a busy server doing frequent writes to compressed tables, you can adjust the innodb_compression_failure_threshold_pct, and innodb_compression_pad_pct_max configuration options to fine-tune this mechanism.

Generally, MySQL requires that each B-tree page in an InnoDB table can accommodate at least two records. For compressed tables, this requirement has been relaxed. Leaf pages of B-tree nodes (whether of the primary key or secondary indexes) only need to accommodate one record, but that record must fit, in uncompressed form, in the per-page modification log. If innodb_strict_mode is ON, MySQL checks the maximum row size during CREATE TABLE or CREATE INDEX. If the row does not fit, the following error message is issued: ERROR HY000: Too big row.

If you create a table when innodb_strict_mode is OFF, and a subsequent INSERT or UPDATE statement attempts to create an index entry that does not fit in the size of the compressed page, the operation fails with ERROR 42000: Row size too large. (This error message does not name the index for which the record is too large, or mention the length of the index record or the maximum record size on that particular index page.) To solve this problem, rebuild the table with ALTER TABLE and select a larger compressed page size (KEY_BLOCK_SIZE), shorten any column prefix indexes, or disable compression entirely with ROW_FORMAT=DYNAMIC or ROW_FORMAT=COMPACT.

innodb_strict_mode is not applicable to general tablespaces, which also support compressed tables. Tablespace management rules for general tablespaces are strictly enforced independently of innodb_strict_mode. For more information, see 13.1.16, “CREATE TABLESPACE Syntax”.

Compressing BLOB, VARCHAR, and TEXT Columns

In an InnoDB table, BLOB, VARCHAR, and TEXT columns that are not part of the primary key may be stored on separately allocated overflow pages. We refer to these columns as off-page columns. Their values are stored on singly-linked lists of overflow pages.

For tables created in ROW_FORMAT=DYNAMIC or ROW_FORMAT=COMPRESSED, the values of BLOB, TEXT, or VARCHAR columns may be stored fully off-page, depending on their length and the length of the entire row. For columns that are stored off-page, the clustered index record only contains 20-byte pointers to the overflow pages, one per column. Whether any columns are stored off-page depends on the page size and the total size of the row. When the row is too long to fit entirely within the page of the clustered index, MySQL chooses the longest columns for off-page storage until the row fits on the clustered index page. As noted above, if a row does not fit by itself on a compressed page, an error occurs.

Note

For tables created in ROW_FORMAT=DYNAMIC or ROW_FORMAT=COMPRESSED, TEXT and BLOB columns that are less than or equal to 40 bytes are always stored in-line.

Tables that use ROW_FORMAT=REDUNDANT and ROW_FORMAT=COMPACT store the first 768 bytes of BLOB, VARCHAR, and TEXT columns in the clustered index record along with the primary key. The 768-byte prefix is followed by a 20-byte pointer to the overflow pages that contain the rest of the column value.

When a table is in COMPRESSED format, all data written to overflow pages is compressed as is; that is, MySQL applies the zlib compression algorithm to the entire data item. Other than the data, compressed overflow pages contain an uncompressed header and trailer comprising a page checksum and a link to the next overflow page, among other things. Therefore, very significant storage savings can be obtained for longer BLOB, TEXT, or VARCHAR columns if the data is highly compressible, as is often the case with text data. Image data, such as JPEG, is typically already compressed and so does not benefit much from being stored in a compressed table; the double compression can waste CPU cycles for little or no space savings.

The overflow pages are of the same size as other pages. A row containing ten columns stored off-page occupies ten overflow pages, even if the total length of the columns is only 8K bytes. In an uncompressed table, ten uncompressed overflow pages occupy 160K bytes. In a compressed table with an 8K page size, they occupy only 80K bytes. Thus, it is often more efficient to use compressed table format for tables with long column values.

For file-per-table tablespaces, using a 16K compressed page size can reduce storage and I/O costs for BLOB, VARCHAR, or TEXT columns, because such data often compress well, and might therefore require fewer overflow pages, even though the B-tree nodes themselves take as many pages as in the uncompressed form. General tablespaces do not support a 16K compressed page size (KEY_BLOCK_SIZE). For more information, see 15.7.9 节, “InnoDB General Tablespaces”.

Compression and the InnoDB Buffer Pool

In a compressed InnoDB table, every compressed page (whether 1K, 2K, 4K or 8K) corresponds to an uncompressed page of 16K bytes (or a smaller size if innodb_page_size is set). To access the data in a page, MySQL reads the compressed page from disk if it is not already in the buffer pool, then uncompresses the page to its original form. This section describes how InnoDB manages the buffer pool with respect to pages of compressed tables.

To minimize I/O and to reduce the need to uncompress a page, at times the buffer pool contains both the compressed and uncompressed form of a database page. To make room for other required database pages, MySQL can evict from the buffer pool an uncompressed page, while leaving the compressed page in memory. Or, if a page has not been accessed in a while, the compressed form of the page might be written to disk, to free space for other data. Thus, at any given time, the buffer pool might contain both the compressed and uncompressed forms of the page, or only the compressed form of the page, or neither.

MySQL keeps track of which pages to keep in memory and which to evict using a least-recently-used (LRU) list, so that hot (frequently accessed) data tends to stay in memory. When compressed tables are accessed, MySQL uses an adaptive LRU algorithm to achieve an appropriate balance of compressed and uncompressed pages in memory. This adaptive algorithm is sensitive to whether the system is running in an I/O-bound or CPU-bound manner. The goal is to avoid spending too much processing time uncompressing pages when the CPU is busy, and to avoid doing excess I/O when the CPU has spare cycles that can be used for uncompressing compressed pages (that may already be in memory). When the system is I/O-bound, the algorithm prefers to evict the uncompressed copy of a page rather than both copies, to make more room for other disk pages to become memory resident. When the system is CPU-bound, MySQL prefers to evict both the compressed and uncompressed page, so that more memory can be used for hot pages and reducing the need to uncompress data in memory only in compressed form.

Compression and the InnoDB Redo Log Files

Before a compressed page is written to a data file, MySQL writes a copy of the page to the redo log (if it has been recompressed since the last time it was written to the database). This is done to ensure that redo logs are usable for crash recovery, even in the unlikely case that the zlib library is upgraded and that change introduces a compatibility problem with the compressed data. Therefore, some increase in the size of log files, or a need for more frequent checkpoints, can be expected when using compression. The amount of increase in the log file size or checkpoint frequency depends on the int of times compressed pages are modified in a way that requires reorganization and recompression.

To create a compressed table in a file-per-table tablespace, innodb_file_per_table must be enabled. There is no dependence on the innodb_file_per_table setting when creating a compressed table in a general tablespace. For more information, see 15.7.9 节, “InnoDB General Tablespaces”.

15.9.1.6 Compression for OLTP Workloads

Traditionally, the InnoDB compression feature was recommended primarily for read-only or read-mostly workloads, such as in a data warehouse configuration. The rise of SSD storage devices, which are fast but relatively small and expensive, makes compression attractive also for OLTP workloads: high-traffic, interactive web sites can reduce their storage requirements and their I/O operations per second (IOPS) by using compressed tables with applications that do frequent INSERT, UPDATE, and DELETE operations.

These configuration options let you adjust the way compression works for a particular MySQL instance, with an emphasis on performance and scalability for write-intensive operations:

  • innodb_compression_level lets you turn the degree of compression up or down. A higher value lets you fit more data onto a storage device, at the expense of more CPU overhead during compression. A lower value lets you reduce CPU overhead when storage space is not critical, or you expect the data is not especially compressible.

  • innodb_compression_failure_threshold_pct specifies a cutoff point for compression failures during updates to a compressed table. When this threshold is passed, MySQL begins to leave additional free space within each new compressed page, dynamically adjusting the amount of free space up to the percentage of page size specified by innodb_compression_pad_pct_max

  • innodb_compression_pad_pct_max lets you adjust the maximum amount of space reserved within each page to record changes to compressed rows, without needing to compress the entire page again. The higher the value, the more changes can be recorded without recompressing the page. MySQL uses a variable amount of free space for the pages within each compressed table, only when a designated percentage of compression operations fail at runtime, requiring an expensive operation to split the compressed page.

  • innodb_log_compressed_pages lets you disable writing of images of re-compressed pages to the redo log. Re-compression may occur when changes are made to compressed data. This option is enabled by default to prevent corruption that could occur if a different version of the zlib compression algorithm is used during recovery. If you are certain that the zlib version will not change, disable innodb_log_compressed_pages to reduce redo log generation for workloads that modify compressed data.

Because working with compressed data sometimes involves keeping both compressed and uncompressed versions of a page in memory at the same time, when using compression with an OLTP-style workload, be prepared to increase the value of the innodb_buffer_pool_size configuration option.

15.9.1.7 SQL Compression Syntax Warnings and Errors

This section describes syntax warnings and errors that you may encounter when using the table compression feature with file-per-table tablespaces and general tablespaces.

SQL Compression Syntax Warnings and Errors for File-Per-Table Tablespaces

When innodb_strict_mode is enabled (the default), specifying ROW_FORMAT=COMPRESSED or KEY_BLOCK_SIZE in CREATE TABLE or ALTER TABLE statements produces the following error if innodb_file_per_table is disabled.

ERROR 1031 (HY000): Table storage engine for 't1' doesn't have this option
Note

The table is not created if the current configuration does not permit using compressed tables.

When innodb_strict_mode is disabled, specifying ROW_FORMAT=COMPRESSED or KEY_BLOCK_SIZE in CREATE TABLE or ALTER TABLE statements produces the following warnings if innodb_file_per_table is disabled.

mysql> SHOW WARNINGS;
+---------+------+---------------------------------------------------------------+
| Level   | Code | Message                                                       |
+---------+------+---------------------------------------------------------------+
| Warning | 1478 | InnoDB: KEY_BLOCK_SIZE requires innodb_file_per_table.        |
| Warning | 1478 | InnoDB: ignoring KEY_BLOCK_SIZE=4.                            |
| Warning | 1478 | InnoDB: ROW_FORMAT=COMPRESSED requires innodb_file_per_table. |
| Warning | 1478 | InnoDB: assuming ROW_FORMAT=DYNAMIC.                          |
+---------+------+---------------------------------------------------------------+
Note

These messages are only warnings, not errors, and the table is created without compression, as if the options were not specified.

The non-strict behavior lets you import a mysqldump file into a database that does not support compressed tables, even if the source database contained compressed tables. In that case, MySQL creates the table in ROW_FORMAT=DYNAMIC instead of preventing the operation.

To import the dump file into a new database, and have the tables re-created as they exist in the original database, ensure the server has the proper setting for the innodb_file_per_table configuration parameter.

The attribute KEY_BLOCK_SIZE is permitted only when ROW_FORMAT is specified as COMPRESSED or is omitted. Specifying a KEY_BLOCK_SIZE with any other ROW_FORMAT generates a warning that you can view with SHOW WARNINGS. However, the table is non-compressed; the specified KEY_BLOCK_SIZE is ignored).

LevelCodeMessage
Warning1478 InnoDB: ignoring KEY_BLOCK_SIZE=n unless ROW_FORMAT=COMPRESSED.

If you are running with innodb_strict_mode enabled, the combination of a KEY_BLOCK_SIZE with any ROW_FORMAT other than COMPRESSED generates an error, not a warning, and the table is not created.

Table 15.7 节, “ROW_FORMAT and KEY_BLOCK_SIZE Options” provides an overview the ROW_FORMAT and KEY_BLOCK_SIZE options that are used with CREATE TABLE or ALTER TABLE.

Table 15.7 ROW_FORMAT and KEY_BLOCK_SIZE Options

OptionUsage NotesDescription
ROW_FORMAT=​REDUNDANTStorage format used prior to MySQL 5.0.3Less efficient than ROW_FORMAT=COMPACT; for backward compatibility
ROW_FORMAT=​COMPACTDefault storage format since MySQL 5.0.3Stores a prefix of 768 bytes of long column values in the clustered index page, with the remaining bytes stored in an overflow page
ROW_FORMAT=​DYNAMIC Store values within the clustered index page if they fit; if not, stores only a 20-byte pointer to an overflow page (no prefix)
ROW_FORMAT=​COMPRESSED Compresses the table and indexes using zlib
KEY_BLOCK_​SIZE=n Specifies compressed page size of 1, 2, 4, 8 or 16 kilobytes; implies ROW_FORMAT=COMPRESSED. For general tablespaces, a KEY_BLOCK_SIZE value equal to the InnoDB page size is not permitted.

Table 15.8 节, “CREATE/ALTER TABLE Warnings and Errors when InnoDB Strict Mode is OFF” summarizes error conditions that occur with certain combinations of configuration parameters and options on the CREATE TABLE or ALTER TABLE statements, and how the options appear in the output of SHOW TABLE STATUS.

When innodb_strict_mode is OFF, MySQL creates or alters the table, but ignores certain settings as shown below. You can see the warning messages in the MySQL error log. When innodb_strict_mode is ON, these specified combinations of options generate errors, and the table is not created or altered. To see the full description of the error condition, issue the SHOW ERRORS statement: example:

mysql> CREATE TABLE x (id INT PRIMARY KEY, c INT)

-> ENGINE=INNODB KEY_BLOCK_SIZE=33333;

ERROR 1005 (HY000): Can't create table 'test.x' (errno: 1478)

mysql> SHOW ERRORS;
+-------+------+-------------------------------------------+
| Level | Code | Message                                   |
+-------+------+-------------------------------------------+
| Error | 1478 | InnoDB: invalid KEY_BLOCK_SIZE=33333.     |
| Error | 1005 | Can't create table 'test.x' (errno: 1478) |
+-------+------+-------------------------------------------+

Table 15.8 CREATE/ALTER TABLE Warnings and Errors when InnoDB Strict Mode is OFF

SyntaxWarning or Error ConditionResulting ROW_FORMAT, as shown in SHOW TABLE STATUS
ROW_FORMAT=REDUNDANTNoneREDUNDANT
ROW_FORMAT=COMPACTNoneCOMPACT
ROW_FORMAT=COMPRESSED or ROW_FORMAT=DYNAMIC or KEY_BLOCK_SIZE is specifiedIgnored for file-per-table tablespaces unless innodb_file_per_table is enabled. General tablespaces support all row formats. See 15.7.9 节, “InnoDB General Tablespaces”.the default row format for file-per-table tablespaces; the specified row format for general tablespaces
Invalid KEY_BLOCK_SIZE is specified (not 1, 2, 4, 8 or 16)KEY_BLOCK_SIZE is ignoredthe specified row format, or the default row format
ROW_FORMAT=COMPRESSED and valid KEY_BLOCK_SIZE are specifiedNone; KEY_BLOCK_SIZE specified is usedCOMPRESSED
KEY_BLOCK_SIZE is specified with REDUNDANT, COMPACT or DYNAMIC row formatKEY_BLOCK_SIZE is ignoredREDUNDANT, COMPACT or DYNAMIC
ROW_FORMAT is not one of REDUNDANT, COMPACT, DYNAMIC or COMPRESSEDIgnored if recognized by the MySQL parser. Otherwise, an error is issued.the default row format or N/A

When innodb_strict_mode is ON, MySQL rejects invalid ROW_FORMAT or KEY_BLOCK_SIZE parameters and issues errors. Strict mode is ON by default. When innodb_strict_mode is OFF, MySQL issues warnings instead of errors for ignored invalid parameters.

It is not possible to see the chosen KEY_BLOCK_SIZE using SHOW TABLE STATUS. The statement SHOW CREATE TABLE displays the KEY_BLOCK_SIZE (even if it was ignored when creating the table). The real compressed page size of the table cannot be displayed by MySQL.

SQL Compression Syntax Warnings and Errors for General Tablespaces
  • If FILE_BLOCK_SIZE was not defined for the general tablespace when the tablespace was created, the tablespace cannot contain compressed tables. If you attempt to add a compressed table, an error is returned, as shown in the following example:

    mysql> CREATE TABLESPACE `ts1` ADD DATAFILE 'ts1.ibd' Engine=InnoDB;
    Query OK, 0 rows affected (0.01 sec)
    
    mysql> CREATE TABLE t1 (c1 INT PRIMARY KEY) TABLESPACE ts1 ROW_FORMAT=COMPRESSED
    KEY_BLOCK_SIZE=8;
    ERROR 1478 (HY000): InnoDB: Tablespace `ts1` cannot contain a COMPRESSED table
  • Attempting to add a table with an invalid KEY_BLOCK_SIZE to a general tablespace returns an error, as shown in the following example:

    mysql> CREATE TABLESPACE `ts2` ADD DATAFILE 'ts2.ibd' FILE_BLOCK_SIZE = 8192 Engine=InnoDB;
    Query OK, 0 rows affected (0.01 sec)
      
    mysql> CREATE TABLE t2 (c1 INT PRIMARY KEY) TABLESPACE ts2 ROW_FORMAT=COMPRESSED
    KEY_BLOCK_SIZE=4;
    ERROR 1478 (HY000): InnoDB: Tablespace `ts2` uses block size 8192 and cannot
    contain a table with physical page size 4096

    For general tablespaces, the KEY_BLOCK_SIZE of the table must be equal to the FILE_BLOCK_SIZE of the tablespace divided by 1024. For example, if the FILE_BLOCK_SIZE of the tablespace is 8192, the KEY_BLOCK_SIZE of the table must be 8.

  • Attempting to add a table with an uncompressed row format to a general tablespace configured to store compressed tables returns an error, as shown in the following example:

    mysql> CREATE TABLESPACE `ts3` ADD DATAFILE 'ts3.ibd' FILE_BLOCK_SIZE = 8192 Engine=InnoDB;
    Query OK, 0 rows affected (0.01 sec)
    
    mysql> CREATE TABLE t3 (c1 INT PRIMARY KEY) TABLESPACE ts3 ROW_FORMAT=COMPACT;
    ERROR 1478 (HY000): InnoDB: Tablespace `ts3` uses block size 8192 and cannot
    contain a table with physical page size 16384

innodb_strict_mode is not applicable to general tablespaces. Tablespace management rules for general tablespaces are strictly enforced independently of innodb_strict_mode. For more information, see 13.1.16, “CREATE TABLESPACE Syntax”.

For more information about using compressed tables with general tablespaces, see 15.7.9 节, “InnoDB General Tablespaces”.

15.9.2 InnoDB Page Compression

InnoDB supports page-level compression for tables that reside in file-per-table tablespaces. This feature is referred to as Transparent Page Compression. Page compression is enabled by specifying the COMPRESSION attribute with CREATE TABLE or ALTER TABLE. Supported compression algorithms include Zlib and LZ4.

Supported Platforms

Page compression requires sparse file and hole punching support. Page compression is supported on Windows with NTFS, and on the following subset of MySQL-supported Linux platforms where the kernel level provides hole punching support:

  • RHEL 7 and derived distributions that use kernel version 3.10.0-123 or higher

  • OEL 5.10 (UEK2) kernel version 2.6.39 or higher

  • OEL 6.5 (UEK3) kernel version 3.8.13 or higher

  • OEL 7.0 kernel version 3.8.13 or higher

  • SLE11 kernel version 3.0-x

  • SLE12 kernel version 3.12-x

  • OES11 kernel version 3.0-x

  • Ubuntu 14.0.4 LTS kernel version 3.13 or higher

  • Ubuntu 12.0.4 LTS kernel version 3.2 or higher

  • Debian 7 kernel version 3.2 or higher

Note

All of the available file systems for a given Linux distribution may not support hole punching.

How Page Compression Works

When a page is written, it is compressed using the specified compression algorithm. The compressed data is written to disk, where the hole punching mechanism releases empty blocks from the end of the page. If compression fails, data is written out as-is.

Hole Punch Size on Linux

On Linux systems, the file system block size is the unit size used for hole punching. Therefore, page compression only works if page data can be compressed to a size that is less than or equal to the InnoDB page size minus the file system block size. For example, if innodb_page_size=16K and the file system block size is 4K, page data must compress to less than or equal to 12K to make hole punching possible.

Hole Punch Size on Windows

On Windows systems, the underlying infrastructure for sparse files is based on NTFS compression. Hole punching size is the NTFS compression unit, which is 16 times the NTFS cluster size. Cluster sizes and their compression units are shown in the following table:

Table 15.9 Windows NTFS Cluster Size and Compression Units

Cluster SizeCompression Unit
512 Bytes8 KB
1 KB16 KB
2 KB32 KB
4 KB64 KB

Page compression on Windows systems only works if page data can be compressed to a size that is less than or equal to the InnoDB page size minus the compression unit size.

The default NTFS cluster size is 4K, for which the compression unit size is 64K. This means that page compression has no benefit for an out-of-the box Windows NTFS configuration, as the maximum innodb_page_size is also 64K.

For page compression to work on Windows, the file system must be created with a cluster size smaller than 4K, and the innodb_page_size must be at least twice the size of the compression unit. For example, for page compression to work on Windows, you could build the file system with a cluster size of 512 Bytes (which has a compression unit of 8KB) and initialize InnoDB with an innodb_page_size value of 16K or greater.

Enabling Page Compression

To enable page compression, specify the COMPRESSION attribute in the CREATE TABLE statement. For example:

CREATE TABLE t1 (c1 INT) COMPRESSION="zlib";

You can also enable page compression in an ALTER TABLE statement. However, ALTER TABLE ... COMPRESSION only updates the tablespace compression attribute. Writes to the tablespace that occur after setting the new compression algorithm use the new setting, but to apply the new compression algorithm to existing pages, you must rebuild the table using OPTIMIZE TABLE.

ALTER TABLE t1 COMPRESSION="zlib";
OPTIMIZE TABLE t1;

Disabling Page Compression

To disable page compression, set COMPRESSION=None using ALTER TABLE. Writes to the tablespace that occur after setting COMPRESSION=None no longer use page compression. To uncompress existing pages, you must rebuild the table using OPTIMIZE TABLE after setting COMPRESSION=None.

ALTER TABLE t1 COMPRESSION="None";
OPTIMIZE TABLE t1;

Page Compression Metadata

Page compression metadata is found in the INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES table, in the following columns:

  • FS_BLOCK_SIZE: The file system block size, which is the unit size used for hole punching.

  • FILE_SIZE: The apparent size of the file, which represents the maximum size of the file, uncompressed.

  • ALLOCATED_SIZE: The actual size of the file, which is the amount of space allocated on disk.

Note

On Unix-like systems, ls -l tablespace_name.ibd shows the apparent file size (equivalent to FILE_SIZE) in bytes. To view the actual amount of space allocated on disk (equivalent to ALLOCATED_SIZE), use du --block-size=1 tablespace_name.ibd. The --block-size=1 option prints the allocated space in bytes instead of blocks, so that it can be compared to ls -l output.

Use SHOW CREATE TABLE to view the current page compression setting (Zlib, Lz4, or None). A table may contain a mix of pages with different compression settings.

In the following example, page compression metadata for the employees table is retrieved from the INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES table.

# Create the employees table with Zlib page compression

CREATE TABLE employees (
    emp_no      INT             NOT NULL,
    birth_date  DATE            NOT NULL,
    first_name  VARCHAR(14)     NOT NULL,
    last_name   VARCHAR(16)     NOT NULL,
    gender      ENUM ('M','F')  NOT NULL,  
    hire_date   DATE            NOT NULL,
    PRIMARY KEY (emp_no)
) COMPRESSION="zlib";

# Insert data (not shown)
  
# Query page compression metadata in INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES
  
mysql> SELECT SPACE, NAME, FS_BLOCK_SIZE, FILE_SIZE, ALLOCATED_SIZE FROM
INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES WHERE NAME='employees/employees'\G
*************************** 1. row ***************************
SPACE: 45
NAME: employees/employees
FS_BLOCK_SIZE: 4096
FILE_SIZE: 23068672
ALLOCATED_SIZE: 19415040

Page compression metadata for the employees table shows that the apparent file size is 23068672 bytes while the actual file size (with page compression) is 19415040 bytes. The file system block size is 4096 bytes, which is the block size used for hole punching.

Page Compression Limitations and Usage Notes

  • Page compression is disabled if the file system block size (or compression unit size on Windows) * 2 > innodb_page_size.

  • Page compression is not supported for tables that reside in shared tablespaces, which include the system tablespace, the temporary tablespace, and general tablespaces.

  • Page compression is not supported for undo log tablespaces.

  • Page compression is not supported for redo log pages.

  • R-tree pages, which are used for spatial indexes, are not compressed.

  • Pages that belong to compressed tables (ROW_FORMAT=COMPRESSED) are left as-is.

  • During recovery, updated pages are written out in an uncompressed form.

  • Loading a page-compressed tablespace on a server that does not support the compression algorithm that was used causes an I/O error.

  • Before downgrading to an earlier version of MySQL that does not support page compression, uncompress the tables that use the page compression feature. To uncompress a table, run ALTER TABLE ... COMPRESSION=None and OPTIMIZE TABLE.

  • Page-compressed tablespaces can be copied between Linux and Windows servers if the compression algorithm that was used is available on both servers.

  • Preserving page compression when moving a page-compressed tablespace file from one host to another requires a utility that preserves sparse files.

  • Better page compression may be achieved on Fusion-io hardware with NVMFS than on other platforms, as NVMFS is designed to take advantage of punch hole functionality.

  • Using the page compression feature with a large InnoDB page size and relatively small file system block size could result in write amplification. For example, a maximum InnoDB page size of 64KB with a 4KB file system block size may improve compression but may also increase demand on the buffer pool, leading to increased I/O and potential write amplification.

15.10 InnoDB 行存储和行格式

This section discusses how InnoDB features such as table compression, off-page storage of long variable-length column values, and large index key prefixes are controlled by the row format of an InnoDB table. It also discusses considerations for choosing the right row format, and compatibility of row formats between MySQL releases.

15.10.1 Overview of InnoDB Row Storage

The storage for rows and associated columns affects performance for queries and DML operations. As more rows fit into a single disk page, queries and index lookups can work faster, less cache memory is required in the InnoDB buffer pool, and less I/O is required to write out updated values for the numeric and short string columns.

The data in each InnoDB table is divided into pages. The pages that make up each table are arranged in a tree data structure called a B-tree index. Table data and secondary indexes both use this type of structure. The B-tree index that represents an entire table is known as the clustered index, which is organized according to the primary key columns. The nodes of the index data structure contain the values of all the columns in that row (for the clustered index) or the index columns and the primary key columns (for secondary indexes).

Variable-length columns are an exception to this rule. Columns such as BLOB and VARCHAR that are too long to fit on a B-tree page are stored on separately allocated disk pages called overflow pages. We call such columns off-page columns. The values of these columns are stored in singly-linked lists of overflow pages, and each such column has its own list of one or more overflow pages. In some cases, all or a prefix of the long column value is stored in the B-tree, to avoid wasting storage and eliminating the need to read a separate page.

The following sections describe how to configure the row format of InnoDB tables to control how variable-length columns values are stored. Row format configuration also determines the availability of the table compression feature and large index key prefix support.

15.10.2 Specifying the Row Format for a Table

The default row format is defined by innodb_default_row_format, which has a default value of DYNAMIC. The default row format is used when the ROW_FORMAT table option is not defined explicitly or when ROW_FORMAT=DEFAULT is specified.

The row format of a table can be defined explicitly using the ROW_FORMAT table option in a CREATE TABLE or ALTER TABLE statement. For example:

CREATE TABLE t1 (c1 INT) ROW_FORMAT=DYNAMIC;

An explicitly defined ROW_FORMAT setting overrides the implicit default. Specifying ROW_FORMAT=DEFAULT is equivalent to using the implicit default.

The innodb_default_row_format option can be set dynamically:

mysql> SET GLOBAL innodb_default_row_format=DYNAMIC;

Valid innodb_default_row_format options include DYNAMIC, COMPACT, and REDUNDANT. The COMPRESSED row format, which is not supported for use in the system tablespace, cannot be defined as the default. It can only be specified explicitly in a CREATE TABLE or ALTER TABLE statement. Attempting to set innodb_default_row_format to COMPRESSED returns an error:

mysql> SET GLOBAL innodb_default_row_format=COMPRESSED;
ERROR 1231 (42000): Variable 'innodb_default_row_format'
can't be set to the value of 'COMPRESSED'

Newly created tables use the row format defined by innodb_default_row_format when a ROW_FORMAT option is not specified explicitly or when ROW_FORMAT=DEFAULT is used. For example, the following CREATE TABLE statements use the row format defined by innodb_default_row_format.

CREATE TABLE t1 (c1 INT);
CREATE TABLE t2 (c1 INT) ROW_FORMAT=DEFAULT;

When a ROW_FORMAT option is not specified explicitly or when ROW_FORMAT=DEFAULT is used, any operation that rebuilds a table also silently changes the row format of the table to the format defined by innodb_default_row_format.

Table-rebuilding operations include ALTER TABLE operations that use ALGORITHM=COPY or ALTER TABLE operations that use ALGORITM=INPLACE where table rebuilding is required. See Table 15.10, “Online Status for DDL Operations” for an overview of the online status of DDL operations. OPTIMIZE TABLE is also a table-rebuilding operation.

The following example demonstrates a table-rebuilding operation that silently changes the row format of a table created without an explicitly defined row format.

mysql> SELECT @@innodb_default_row_format;
+-----------------------------+
| @@innodb_default_row_format |
+-----------------------------+
| dynamic                     |
+-----------------------------+

mysql> CREATE TABLE t1 (c1 INT);

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE NAME LIKE 'test/t1' \G
*************************** 1. row ***************************
     TABLE_ID: 54
         NAME: test/t1
         FLAG: 33
       N_COLS: 4
        SPACE: 35
   ROW_FORMAT: Dynamic
ZIP_PAGE_SIZE: 0
   SPACE_TYPE: Single

mysql> SET GLOBAL innodb_default_row_format=COMPACT;

mysql> ALTER TABLE t1 ADD COLUMN (c2 INT);

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE NAME LIKE 'test/t1' \G
*************************** 1. row ***************************
     TABLE_ID: 55
         NAME: test/t1
         FLAG: 1
       N_COLS: 5
        SPACE: 36
   ROW_FORMAT: Compact
ZIP_PAGE_SIZE: 0
   SPACE_TYPE: Single

Consider the following potential issues before changing the row format of existing tables from REDUNDANT or COMPACT to DYNAMIC.

  • The REDUNDANT and COMPACT row format supports a maximum index key prefix length of 767 bytes whereas DYNAMIC and COMPRESSED row formats support an index key prefix length of 3072 bytes. In a replication environment, if innodb_default_row_format is set to DYNAMIC on the master and set to COMPACT on the slave, the following DDL statement, which does not explicitly define a row format, succeeds on the master but fails on the slave:

    CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 VARCHAR(5000), KEY i1(c2(3070)));

    For related information, see 15.8.1.7 节, “Limits on InnoDB Tables”.

  • Importing a table that does not explicitly define a row format results in a schema mismatch error if the innodb_default_row_format setting on the source server differs from the setting on the destination server. For more information, refer to the limitations outlined in 15.7.6 节, “Copying File-Per-Table Tablespaces to Another Instance”.

To view the row format of a table, issue a SHOW TABLE STATUS statement or query INFORMATION_SCHEMA.TABLES.

SELECT * FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE NAME LIKE 'test/t1' \G

The row format of an InnoDB table determines its physical row structure. See 15.8.1.2 节, “The Physical Row Structure of an InnoDB Table” for more information.

15.10.3 DYNAMIC and COMPRESSED Row Formats

When a table is created with ROW_FORMAT=DYNAMIC or ROW_FORMAT=COMPRESSED, InnoDB can store long variable-length column values (for VARCHAR, VARBINARY, and BLOB and TEXT types) fully off-page, with the clustered index record containing only a 20-byte pointer to the overflow page. InnoDB also encodes fixed-length fields greater than or equal to 768 bytes in length as variable-length fields. For example, a CHAR(255) column can exceed 768 bytes if the maximum byte length of the character set is greater than 3, as it is with utf8mb4.

Whether any columns are stored off-page depends on the page size and the total size of the row. When the row is too long, InnoDB chooses the longest columns for off-page storage until the clustered index record fits on the B-tree page. TEXT and BLOB columns that are less than or equal to 40 bytes are always stored in-line.

The DYNAMIC row format maintains the efficiency of storing the entire row in the index node if it fits (as do the COMPACT and REDUNDANT formats), but the DYNAMIC row format avoids the problem of filling B-tree nodes with a large int of data bytes of long columns. The DYNAMIC format is based on the idea that if a portion of a long data value is stored off-page, it is usually most efficient to store all of the value off-page. With DYNAMIC format, shorter columns are likely to remain in the B-tree node, minimizing the int of overflow pages needed for any given row.

The COMPRESSED row format uses similar internal details for off-page storage as the DYNAMIC row format, with additional storage and performance considerations from the table and index data being compressed and using smaller page sizes. With the COMPRESSED row format, the KEY_BLOCK_SIZE option controls how much column data is stored in the clustered index, and how much is placed on overflow pages. For full details about the COMPRESSED row format, see 15.9 节, “InnoDB Table and Page Compression”.

Both DYNAMIC and COMPRESSED row formats support index key prefixes up to 3072 bytes.

Tables that use the COMPRESSED row format can be created in file-per-table tablespaces or general tablespaces. The system tablespace does not support the COMPRESSED row format. To store a COMPRESSED table in a file-per-table tablespace, innodb_file_per_table must be enabled. The innodb_file_per_table configuration options is not applicable to general tablespaces. General tablespaces support all row formats with the caveat that compressed and uncompressed tables cannot coexist in the same general tablespace due to different physical page sizes. For more information about general tablespaces, see 15.7.9 节, “InnoDB General Tablespaces”.

DYNAMIC tables can be stored in file-per-table tablespaces, general tablespaces, and the system tablespace. To store DYNAMIC tables in the system tablespace, you can either disable innodb_file_per_table and use a regular CREATE TABLE or ALTER TABLE statement, or you can use the TABLESPACE [=] innodb_system table option with CREATE TABLE or ALTER TABLE without having to alter your innodb_file_per_table setting. The innodb_file_per_table configuration option is not applicable to general tablespaces, nor are they applicable when using the TABLESPACE [=] innodb_system table option to store DYNAMIC tables in the system tablespace.

InnoDB does not support compressed temporary tables. When innodb_strict_mode is enabled (the default), CREATE TEMPORARY TABLE returns an error if ROW_FORMAT=COMPRESSED or KEY_BLOCK_SIZE is specified. If innodb_strict_mode is disabled, warnings are issued and the temporary table is created using a non-compressed row format.

DYNAMIC and COMPRESSED row formats are variations of the COMPACT row format and therefore handle CHAR storage in the same way as the COMPACT row format. For more information, see 15.8.1.2 节, “The Physical Row Structure of an InnoDB Table”.

15.10.4 COMPACT and REDUNDANT Row Formats

InnoDB tables that use the COMPACT or REDUNDANT row format store up to the first 768 bytes of variable-length columns (VARCHAR, VARBINARY, and BLOB and TEXT types) in the index record within the B-tree node, with the remainder stored on the overflow pages. InnoDB also encodes fixed-length fields greater than or equal to 768 bytes in length as variable-length fields, which can be stored off-page. For example, a CHAR(255) column can exceed 768 bytes if the maximum byte length of the character set is greater than 3, as it is with utf8mb4.

For COMPACT or REDUNDANT row formats, if the value of a column is 768 bytes or less, no overflow page is needed, and some savings in I/O may result, since the value is in the B-tree node. This works well for relatively short BLOBs, but may cause B-tree nodes to fill with data rather than key values, reducing their efficiency. Tables with many BLOB columns could cause B-tree nodes to become too full of data, and contain too few rows, making the entire index less efficient than if the rows were shorter or if the column values were stored off-page.

The default row format is DYNAMIC, as defined by the innodb_default_row_format configuration option. See 15.10.3 节, “DYNAMIC and COMPRESSED Row Formats” for more information.

For information about the physical row structure of tables that use the REDUNDANT or COMPACT row format, see 15.8.1.2 节, “The Physical Row Structure of an InnoDB Table”.

15.11 InnoDB磁盘I/O 和文件空间管理

As a DBA, you must manage disk I/O to keep the I/O subsystem from becoming saturated, and manage disk space to avoid filling up storage devices. The ACID design model requires a certain amount of I/O that might seem redundant, but helps to ensure data reliability. Within these constraints, InnoDB tries to optimize the database work and the organization of disk files to minimize the amount of disk I/O. Sometimes, I/O is postponed until the database is not busy, or until everything needs to be brought to a consistent state, such as during a database restart after a fast shutdown.

This section discusses the main considerations for I/O and disk space with the default kind of MySQL tables (also known as InnoDB tables):

  • Controlling the amount of background I/O used to improve query performance.

  • Enabling or disabling features that provide extra durability at the expense of additional I/O.

  • Organizing tables into many small files, a few larger files, or a combination of both.

  • Balancing the size of redo log files against the I/O activity that occurs when the log files become full.

  • How to reorganize a table for optimal query performance.

15.11.1 InnoDB Disk I/O

InnoDB uses asynchronous disk I/O where possible, by creating a int of threads to handle I/O operations, while permitting other database operations to proceed while the I/O is still in progress. On Linux and Windows platforms, InnoDB uses the available OS and library functions to perform native asynchronous I/O. On other platforms, InnoDB still uses I/O threads, but the threads may actually wait for I/O requests to complete; this technique is known as simulated asynchronous I/O.

Read-Ahead

If InnoDB can determine there is a high probability that data might be needed soon, it performs read-ahead operations to bring that data into the buffer pool so that it is available in memory. Making a few large read requests for contiguous data can be more efficient than making several small, spread-out requests. There are two read-ahead heuristics in InnoDB:

  • In sequential read-ahead, if InnoDB notices that the access pattern to a segment in the tablespace is sequential, it posts in advance a batch of reads of database pages to the I/O system.

  • In random read-ahead, if InnoDB notices that some area in a tablespace seems to be in the process of being fully read into the buffer pool, it posts the remaining reads to the I/O system.

For information about configuring read-ahead heuristics, see 15.6.3.5 节, “Configuring InnoDB Buffer Pool Prefetching (Read-Ahead)”.

Doublewrite Buffer

InnoDB uses a novel file flush technique involving a structure called the doublewrite buffer, which is enabled by default in most cases (innodb_doublewrite=ON). It adds safety to recovery following a crash or power outage, and improves performance on most varieties of Unix by reducing the need for fsync() operations.

Before writing pages to a data file, InnoDB first writes them to a contiguous tablespace area called the doublewrite buffer. Only after the write and the flush to the doublewrite buffer has completed does InnoDB write the pages to their proper positions in the data file. If there is an operating system, storage subsystem, or mysqld process crash in the middle of a page write (causing a torn page condition), InnoDB can later find a good copy of the page from the doublewrite buffer during recovery.

If system tablespace files (ibdata files) are located on Fusion-io devices that support atomic writes, doublewrite buffering is automatically disabled and Fusion-io atomic writes are used for all data files. Because the doublewrite buffer setting is global, doublewrite buffering is also disabled for data files residing on non-Fusion-io hardware. This feature is only supported on Fusion-io hardware and is only enabled for Fusion-io NVMFS on Linux. To take full advantage of this feature, an innodb_flush_method setting of O_DIRECT is recommended.

15.11.2 File Space Management

The data files that you define in the configuration file using the innodb_data_file_path configuration option form the InnoDB system tablespace. The files are logically concatenated to form the system tablespace. There is no striping in use. You cannot define where within the system tablespace your tables are allocated. In a newly created system tablespace, InnoDB allocates space starting from the first data file.

To avoid the issues that come with storing all tables and indexes inside the system tablespace, you can enable the innodb_file_per_table configuration option (the default), which stores each newly created table in a separate tablespace file (with extension .ibd). For tables stored this way, there is less fragmentation within the disk file, and when the table is truncated, the space is returned to the operating system rather than still being reserved by InnoDB within the system tablespace. For more information, see 15.7.4 节, “InnoDB File-Per-Table Tablespaces”.

You can also store tables in general tablespaces. General tablespaces are shared tablespaces created using CREATE TABLESPACE syntax. They can be created outside of the MySQL data directory, are capable of holding multiple tables, and support tables of all row formats. For more information, see 15.7.9 节, “InnoDB General Tablespaces”.

Pages, Extents, Segments, and Tablespaces

Each tablespace consists of database pages. Every tablespace in a MySQL instance has the same page size. 默认情况下, all tablespaces have a page size of 16KB; you can reduce the page size to 8KB or 4KB by specifying the innodb_page_size option when you create the MySQL instance. You can also increase the page size to 32KB or 64KB. For more information, refer to the innodb_page_size documentation.

The pages are grouped into extents of size 1MB for pages up to 16KB in size (64 consecutive 16KB pages, or 128 8KB pages, or 256 4KB pages). For a page size of 32KB, extent size is 2MB. For page size of 64KB, extent size is 4MB. The files inside a tablespace are called segments in InnoDB. (These segments are different from the rollback segment, which actually contains many tablespace segments.)

When a segment grows inside the tablespace, InnoDB allocates the first 32 pages to it one at a time. After that, InnoDB starts to allocate whole extents to the segment. InnoDB can add up to 4 extents at a time to a large segment to ensure good sequentiality of data.

Two segments are allocated for each index in InnoDB. One is for nonleaf nodes of the B-tree, the other is for the leaf nodes. Keeping the leaf nodes contiguous on disk enables better sequential I/O operations, because these leaf nodes contain the actual table data.

Some pages in the tablespace contain bitmaps of other pages, and therefore a few extents in an InnoDB tablespace cannot be allocated to segments as a whole, but only as individual pages.

When you ask for available free space in the tablespace by issuing a SHOW TABLE STATUS statement, InnoDB reports the extents that are definitely free in the tablespace. InnoDB always reserves some extents for cleanup and other internal purposes; these reserved extents are not included in the free space.

When you delete data from a table, InnoDB contracts the corresponding B-tree indexes. Whether the freed space becomes available for other users depends on whether the pattern of deletes frees individual pages or extents to the tablespace. Dropping a table or deleting all rows from it is guaranteed to release the space to other users, but remember that deleted rows are physically removed only by the purge operation, which happens automatically some time after they are no longer needed for transaction rollbacks or consistent reads. (See 15.3 节, “InnoDB Multi-Versioning”.)

How Pages Relate to Table Rows

The maximum row length is slightly less than half a database page for 4KB, 8KB, 16KB, and 32KB innodb_page_size settings. For example, the maximum row length is slightly less than 8KB for the default 16KB InnoDB page size. For 64KB pages, the maximum row length is slightly less than 16KB.

If a row does not exceed the maximum row length, all of it is stored locally within the page. If a row exceeds the maximum row length, variable-length columns are chosen for external off-page storage until the row fits within the maximum row length limit. External off-page storage for variable-length columns differs by row format:

  • COMPACT and REDUNDANT Row Formats

    When a variable-length column is chosen for external off-page storage, InnoDB stores the first 768 bytes locally in the row, and the rest externally into overflow pages. Each such column has its own list of overflow pages. The 768-byte prefix is accompanied by a 20-byte value that stores the true length of the column and points into the overflow list where the rest of the value is stored. See 15.10.4 节, “COMPACT and REDUNDANT Row Formats”.

  • DYNAMIC and COMPRESSED Row Formats

    When a variable-length column is chosen for external off-page storage, InnoDB stores a 20-byte pointer locally in the row, and the rest externally into overflow pages. See 15.10.3 节, “DYNAMIC and COMPRESSED Row Formats”.

LONGBLOB and LONGTEXT columns must be less than 4GB, and the total row length, including BLOB and TEXT columns, must be less than 4GB.

15.11.3 InnoDB Checkpoints

Making your log files very large may reduce disk I/O during checkpointing. It often makes sense to set the total size of the log files as large as the buffer pool or even larger.

How Checkpoint Processing Works

InnoDB implements a checkpoint mechanism known as fuzzy checkpointing. InnoDB flushes modified database pages from the buffer pool in small batches. There is no need to flush the buffer pool in one single batch, which would disrupt processing of user SQL statements during the checkpointing process.

During crash recovery, InnoDB looks for a checkpoint label written to the log files. It knows that all modifications to the database before the label are present in the disk image of the database. Then InnoDB scans the log files forward from the checkpoint, applying the logged modifications to the database.

15.11.4 Defragmenting a Table

Random insertions into or deletions from a secondary index can cause the index to become fragmented. Fragmentation means that the physical ordering of the index pages on the disk is not close to the index ordering of the records on the pages, or that there are many unused pages in the 64-page blocks that were allocated to the index.

One symptom of fragmentation is that a table takes more space than it should take. How much that is exactly, is difficult to determine. All InnoDB data and indexes are stored in B-trees, and their fill factor may vary from 50% to 100%. Another symptom of fragmentation is that a table scan such as this takes more time than it should take:

SELECT COUNT(*) FROM t WHERE non_indexed_column <> 12345;

The preceding query requires MySQL to perform a full table scan, the slowest type of query for a large table.

To speed up index scans, you can periodically perform a null ALTER TABLE operation, which causes MySQL to rebuild the table:

ALTER TABLE tbl_name ENGINE=INNODB

You can also use ALTER TABLE tbl_name FORCE to perform a null alter operation that rebuilds the table.

Both ALTER TABLE tbl_name ENGINE=INNODB and ALTER TABLE tbl_name FORCE use online DDL. For more information, see 15.12.1, “Online DDL Overview”.

Another way to perform a defragmentation operation is to use mysqldump to dump the table to a text file, drop the table, and reload it from the dump file.

If the insertions into an index are always ascending and records are deleted only from the end, the InnoDB filespace management algorithm guarantees that fragmentation in the index does not occur.

15.11.5 Reclaiming Disk Space with TRUNCATE TABLE

To reclaim operating system disk space when truncating an InnoDB table, the table must be stored in its own .ibd file. For a table to be stored in its own .ibd file, innodb_file_per_table must enabled when the table is created. Additionally, there cannot be a foreign key constraint between the table being truncated and other tables, otherwise the TRUNCATE TABLE operation fails. A foreign key constraint between two columns in the same table, however, is permitted.

When a table is truncated, it is dropped and re-created in a new .ibd file, and the freed space is returned to the operating system. This is in contrast to truncating InnoDB tables that are stored within the InnoDB system tablespace (tables created when innodb_file_per_table=OFF) and tables stored in shared general tablespaces, where only InnoDB can use the freed space after the table is truncated.

The ability to truncate tables and return disk space to the operating system also means that physical backups can be smaller. Truncating tables that are stored in the system tablespace (tables created when innodb_file_per_table=OFF) or in a general tablespace leaves blocks of unused space in the tablespace.

15.12 InnoDB 和在线DDL

The InnoDB online DDL feature permits in-place table alterations or concurrent DML, or both. Benefits of this feature include:

  • Improved responsiveness and availability in busy production environments, where making a table unavailable for minutes or hours is not practical.

  • The ability to adjust the balance between performance and concurrency during a DDL operations using the LOCK clause.

    • LOCK=EXCLUSIVE blocks access to the table entirely.

    • LOCK=SHARED allows queries but not DML.

    • LOCK=NONE allows full query and DML access to the table.

    • LOCK=DEFAULT or omitting the LOCK clause permits as much concurrency as possible depending on the type of DDL operation.

  • Avoidance of disk space usage and I/O overhead associated with copying the table and reconstructing secondary indexes.

15.12.1 Online DDL Overview

The online DDL feature enhances many DDL operations that formerly required a table copy or blocked DML operations on the table, or both. Table 15.10, “Online Status for DDL Operations” shows how the online DDL feature applies to each DDL statement.

With the exception of some ALTER TABLE partitioning clauses, online DDL operations for partitioned InnoDB tables follow the same rules that apply to regular InnoDB tables. For more information, see 15.12.7 节, “Online DDL for Partitioned Tables”.

Some factors affect the performance, space usage, and semantics of online DDL operations. For more information, see 15.12.8 节, “Online DDL Limitations”.

  • The In-Place? column shows which operations permit the ALGORITHM=INPLACE clause.

  • The Rebuilds Table? column shows which operations rebuild the table. For operations that use the INPLACE algorithm, the table is rebuilt in place. For operations that do not support the INPLACE algorithm, the table copy method is used to rebuild the table.

  • The Permits Concurrent DML? column shows which operations are performed fully online. You can specify LOCK=NONE to assert that concurrent DML is permitted during the DDL operation. MySQL automatically permits concurrent DML when possible.

    Concurrent queries are permitted during all online DDL operations. You can specify LOCK=SHARED to assert that concurrent queries are permitted during a DDL operation. MySQL automatically permits concurrent queries when possible.

  • The Notes column provides additional information and explains exceptions and dependencies related to the Yes/No values of other columns. An asterisk indicates an exception or dependency.

Table 15.10 Online Status for DDL Operations

OperationIn-Place?Rebuilds Table?Permits Concurrent DML?Only Modifies Metadata?Notes
CREATE INDEX, ADD INDEXYes*No*YesNoRestrictions apply for FULLTEXT indexes; see next row.
ADD FULLTEXT INDEXYes*No*NoNoAdding the first FULLTEXT index rebuilds the table if there is no user-defined FTS_DOC_ID column. Subsequent FULLTEXT indexes may be added on the same table without rebuilding the table.
ADD SPATIAL INDEXYesNoNoNo 
RENAME INDEXYesNoYesYesOnly modifies table metadata.
DROP INDEXYesNoYesYesOnly modifies table metadata.
OPTIMIZE TABLEYes*YesYesNoIn-place operation is not supported for tables with FULLTEXT indexes.
Set column default valueYesNoYesYesOnly modifies table metadata.
Change auto-increment valueYesNoYesNo*Modifies a value stored in memory, not the data file.
Add foreign key constraintYes*NoYesYesThe INPLACE algorithm is supported when foreign_key_checks is disabled. Otherwise, only the COPY algorithm is supported.
Drop foreign key constraintYesNoYesYesforeign_key_checks can be enabled or disabled.
Rename columnYes*NoYes*YesTo permit concurrent DML, keep the same data type and only change the column name. ALGORITHM=INPLACE is not supported for renaming a generated column.
Add columnYes*Yes*Yes*NoConcurrent DML is not permitted when adding an auto-increment column. Data is reorganized substantially, making it an expensive operation. ALGORITHM=INPLACE is supported for adding a virtual generated column but not for adding a stored generated column. Adding a virtual generated column does not require a table rebuild.
Drop columnYesYes*YesNoData is reorganized substantially, making it an expensive operation. ALGORITHM=INPLACE is supported for dropping a generated column. Dropping a virtual generated column does not require a table rebuild.
Reorder columnsYesYesYesNoData is reorganized substantially, making it an expensive operation.
Change ROW_FORMAT propertyYesYesYesNoData is reorganized substantially, making it an expensive operation.
Change KEY_BLOCK_SIZE propertyYesYesYesNoData is reorganized substantially, making it an expensive operation.
Make column NULLYesYes*YesNoRebuilds the table in place. Data is reorganized substantially, making it an expensive operation.
Make column NOT NULLYes*Yes*YesNoRebuilds the table in place. STRICT_ALL_TABLES or STRICT_TRANS_TABLES SQL_MODE is required for the operation to succeed. The operation fails if the column contains NULL values. The server prohibits changes to foreign key columns that have the potential to cause loss of referential integrity. See 13.1.7 节, “ALTER TABLE Syntax”. Data is reorganized substantially, making it an expensive operation.
Change column data typeNo*YesNoNoVARCHAR size may be increased using online ALTER TABLE. See Modifying Column Properties for more information.
Add primary keyYes*Yes*YesNoRebuilds the table in place. Data is reorganized substantially, making it an expensive operation. ALGORITHM=INPLACE is not permitted under certain conditions if columns have to be converted to NOT NULL.
Drop primary key and add anotherYesYesYesNoData is reorganized substantially, making it an expensive operation.
Drop primary keyNoYesNoNoOnly ALGORITHM=COPY supports dropping a primary key without adding a new one in the same ALTER TABLE statement.
Convert character setNoYes*NoNoRebuilds the table if the new character encoding is different.
Specify character setNoYes*NoNoRebuilds the table if the new character encoding is different.
Rebuild with FORCE optionYes*YesYesNoUses ALGORITHM=INPLACE. ALGORITHM=INPLACE is not supported for tables with FULLTEXT indexes.
null rebuild using ALTER TABLE ... ENGINE=INNODBYes*YesYesNoUses ALGORITHM=INPLACE. ALGORITHM=INPLACE is not supported for tables with FULLTEXT indexes.
Set STATS_PERSISTENT, STATS_AUTO_RECALC, STATS_SAMPLE_PAGES persistent statistics optionsYesNoYesYesOnly modifies table metadata.
ALTER TABLE … ENCRYPTIONNoYesNoYes 
Drop a STORED columnYesYes*YesNoRebuilds the table in place.
Modify STORED column orderYesYes*YesNoRebuilds the table in place.
Add a STORED columnYesYes*YesNoRebuilds the table in place.
Drop a VIRTUAL columnYesNoYesYes 
Modify VIRTUAL column orderYesNoYesYes 
Add a VIRTUAL columnYesNoYesYes 

The sections that follow provide basic syntax and usage notes for various online DDL operations.

Adding or Dropping Secondary Indexes

  • Adding a secondary index:

    CREATE INDEX name ON table (col_list);
    ALTER TABLE table ADD INDEX name (col_list);
  • Dropping a secondary index:

    DROP INDEX name ON table;
    ALTER TABLE table DROP INDEX name;

Although no syntax changes are required in the CREATE INDEX or DROP INDEX commands, some factors affect the performance, space usage, and semantics of this operation (see 15.12.8 节, “Online DDL Limitations”).

Creating and dropping secondary indexes on InnoDB tables skips the table-copying behavior.

The table remains available for read and write operations while the index is being created or dropped. The CREATE INDEX or DROP INDEX statement only finishes after all transactions that are accessing the table are completed, so that the initial state of the index reflects the most recent contents of the table. Previously, modifying the table while an index is being created or dropped typically resulted in a deadlock that cancelled the INSERT, UPDATE, or DELETE statement on the table.

Online DDL support for adding secondary indexes means that you can generally speed the overall process of creating and loading a table and associated indexes by creating the table without any secondary indexes, then adding the secondary indexes after the data is loaded.

Modifying Column Properties

  • Modify the default value for a column:

    ALTER TABLE tbl ALTER COLUMN col SET DEFAULT literal;
    ALTER TABLE tbl ALTER COLUMN col DROP DEFAULT;

    The default values for columns are stored in the InnoDB data dictionary.

  • Changing the auto-increment value for a column:

    ALTER TABLE table AUTO_INCREMENT=next_value;

    Especially in a distributed system using replication or sharding, you sometimes reset the auto-increment counter for a table to a specific value. The next row inserted into the table uses the specified value for its auto-increment column. You might also use this technique in a data warehousing environment where you periodically empty all the tables and reload them, and you can restart the auto-increment sequence from 1.

  • Renaming a column:

    ALTER TABLE tbl CHANGE old_col_name new_col_name datatype;

    When you keep the same data type and [NOT] NULL attribute, only changing the column name, this operation can always be performed online.

    You can also rename a column that is part of a foreign key constraint. The foreign key definition is automatically updated to use the new column name. Renaming a column participating in a foreign key only works with the in-place mode of ALTER TABLE. If you use the ALGORITHM=COPY clause, or some other condition causes the command to use ALGORITHM=COPY behind the scenes, the ALTER TABLE statement fails.

  • Extending VARCHAR size using an in-place ALTER TABLE statement:

    ALTER TABLE t1 ALGORITHM=INPLACE, CHANGE COLUMN c1 c1 VARCHAR(255);
    

    The int of length bytes required by a VARCHAR column must remain the same. For VARCHAR values of 0 to 255, one length byte is required to encode the value. For VARCHAR values of 256 bytes or more, two length bytes are required. As a result, in-place ALTER TABLE only supports increasing VARCHAR size from 0 to 255 bytes or increasing VARCHAR size from a value equal to or greater than 256 bytes. In-place ALTER TABLE does not support increasing VARCHAR size from less than 256 bytes to a value equal to or greater than 256 bytes. In this case, the int of required length bytes would change from 1 to 2, which is only supported by a table copy (ALGORITHM=COPY). For example, attempting to change VARCHAR column size from 255 to 256 using in-place ALTER TABLE would return an error:

    ALTER TABLE t1 ALGORITHM=INPLACE, CHANGE COLUMN c1 c1 VARCHAR(256);
    ERROR 0A000: ALGORITHM=INPLACE is not supported. Reason: Cannot change
    column type INPLACE. Try ALGORITHM=COPY.
    

    Decreasing VARCHAR size using in-place ALTER TABLE is not supported. Decreasing VARCHAR size requires a table copy (ALGORITHM=COPY).

Adding or Dropping Foreign Keys

  • Adding or dropping a foreign key constraint:

    ALTER TABLE tbl1 ADD CONSTRAINT fk_name FOREIGN KEY index (col1) 
      REFERENCES tbl2(col2) referential_actions;
    
    ALTER TABLE tbl DROP FOREIGN KEY fk_name;

    Dropping a foreign key can be performed online with the foreign_key_checks option enabled or disabled. Creating a foreign key online requires foreign_key_checks to be disabled.

    If you do not know the names of the foreign key constraints on a particular table, issue the following statement and find the constraint name in the CONSTRAINT clause for each foreign key:

    SHOW CREATE TABLE table\G
    

    Or, query the INFORMATION_SCHEMA.TABLE_CONSTRAINTS table and use the CONSTRAINT_NAME and CONSTRAINT_TYPE columns to identify the foreign key names.

    You can also drop a foreign key and its associated index in a single statement:

    ALTER TABLE table DROP FOREIGN KEY constraint, DROP INDEX index;
    

If foreign keys are already present in the table being altered (that is, it is a child table containing a FOREIGN KEY ... REFERENCE clause), additional restrictions apply to online DDL operations, even those not directly involving the foreign key columns:

  • An ALTER TABLE on the child table could wait for another transaction to commit, if a change to the parent table caused associated changes in the child table through an ON UPDATE or ON DELETE clause using the CASCADE or SET NULL parameters.

  • In the same way, if a table is the parent table in a foreign key relationship, even though it does not contain any FOREIGN KEY clauses, it could wait for the ALTER TABLE to complete if an INSERT, UPDATE, or DELETE statement caused an ON UPDATE or ON DELETE action in the child table.

Maintaining CREATE TABLE Statements

As your database schema evolves with new columns, data types, constraints, indexes, and so on, keep your CREATE TABLE statements up to date with the latest table definitions. Even with the performance improvements of online DDL, it is more efficient to create stable database structures at the beginning, rather than creating part of the schema and then issuing ALTER TABLE statements afterward.

The main exception to this guideline is for secondary indexes on tables with large ints of rows. It is typically most efficient to create the table with all details specified except the secondary indexes, load the data, then create the secondary indexes. You can use the same technique with foreign keys (load the data first, then set up the foreign keys) if you know the initial data is clean and do not need consistency checks during the loading process.

Whatever sequence of CREATE TABLE, CREATE INDEX, ALTER TABLE, and similar statements went into putting a table together, you can capture the SQL needed to reconstruct the current form of the table by issuing the statement SHOW CREATE TABLE table\G (uppercase \G required for tidy formatting). This output shows clauses such as numeric precision, NOT NULL, and CHARACTER SET that are sometimes added behind the scenes, which you may want to leave out when cloning the table on a new system or setting up foreign key columns with identical type.

15.12.2 Online DDL Performance, Concurrency, and Space Requirements

Online DDL improves several aspects of MySQL operation, such as performance, concurrency, availability, and scalability:

  • Because queries and DML operations on the table can proceed while the DDL is in progress, applications that access the table are more responsive. Reduced locking and waiting for other resources throughout the MySQL server leads to greater scalability, even for operations not involving the table being altered.

  • For in-place operations, by avoiding the disk I/O and CPU cycles to rebuild the table, you minimize the overall load on the database and maintain good performance and high throughput during the DDL operation.

  • For in-place operations, because less data is read into the buffer pool than if all the data was copied, you avoid purging frequently accessed data from memory, which formerly could cause a temporary performance dip after a DDL operation.

If an online operation requires temporary sort files, InnoDB creates them in the temporary file directory 默认情况下, not the directory containing the original table. If this directory is not large enough to hold such files, you may need to set the tmpdir system variable to a different directory. Alternatively, you can define a separate temporary directory for InnoDB online ALTER TABLE operations using the innodb_tmpdir configuration option. For more information, see Space Requirements for Online DDL Operations, and Section B.5.3.5 节, “Where MySQL Stores Temporary Files”.

Locking Options for Online DDL

While an InnoDB table is being changed by a DDL operation, the table may or may not be locked, depending on the internal workings of that operation and the LOCK clause of the ALTER TABLE statement. 默认情况下, MySQL uses as little locking as possible during a DDL operation; you specify the clause either to make the locking more restrictive than it normally would be (thus limiting concurrent DML, or DML and queries), or to ensure that some expected degree of locking is allowed for an operation. If the LOCK clause specifies a level of locking that is not available for that specific kind of DDL operation, such as LOCK=SHARED or LOCK=NONE while creating or dropping a primary key, the clause works like an assertion, causing the statement to fail with an error. The following list shows the different possibilities for the LOCK clause, from the most permissive to the most restrictive:

  • For DDL operations with LOCK=NONE, both queries and concurrent DML are allowed. This clause makes the ALTER TABLE fail if the kind of DDL operation cannot be performed with the requested type of locking, so specify LOCK=NONE if keeping the table fully available is vital and it is OK to cancel the DDL if that is not possible. For example, you might use this clause in DDLs for tables involving customer signups or purchases, to avoid making those tables unavailable by mistakenly issuing an expensive ALTER TABLE statement.

  • For DDL operations with LOCK=SHARED, any writes to the table (that is, DML operations) are blocked, but the data in the table can be read. This clause makes the ALTER TABLE fail if the kind of DDL operation cannot be performed with the requested type of locking, so specify LOCK=SHARED if keeping the table available for queries is vital and it is OK to cancel the DDL if that is not possible. For example, you might use this clause in DDLs for tables in a data warehouse, where it is OK to delay data load operations until the DDL is finished, but queries cannot be delayed for long periods.

  • For DDL operations with LOCK=DEFAULT, or with the LOCK clause omitted, MySQL uses the lowest level of locking that is available for that kind of operation, allowing concurrent queries, DML, or both wherever possible. This is the setting to use when making pre-planned, pre-tested changes that you know do not cause any availability problems based on the workload for that table.

  • For DDL operations with LOCK=EXCLUSIVE, both queries and DML operations are blocked. This clause makes the ALTER TABLE fail if the kind of DDL operation cannot be performed with the requested type of locking, so specify LOCK=EXCLUSIVE if the primary concern is finishing the DDL in the shortest time possible, and it is OK to make applications wait when they try to access the table. You might also use LOCK=EXCLUSIVE if the server is supposed to be idle, to avoid unexpected accesses to the table.

In most cases, an online DDL operation on a table waits for currently executing transactions that are accessing the table to commit or roll back because it requires exclusive access to the table for a brief period while the DDL statement is being prepared. Likewise, the online DDL operation requires exclusive access to the table for a brief time before finishing. Thus, an online DDL statement also waits for transactions that are started while the DDL is in progress to commit or roll back before completing. Consequently, in the case of long running transactions that perform inserts, updates, deletes, or SELECT ... FOR UPDATE operations on the table, it is possible for online DDL operation to time out waiting for exclusive access to the table.

A case in which an online DDL operation on a table does not wait for a currently executing transaction to complete can occur when the table is in a foreign key relationship and a transaction is run explicitly on the other table in the foreign key relationship. In this case, the transaction holds an exclusive metadata lock on the table it is updating, but only holds shared InnoDB table lock (required for foreign key checking) on the other table. The shared InnoDB table lock permits the online DDL operation to proceed but blocks the operation at the commit phase, when an exclusive InnoDB table lock is required. This scenario can result in deadlocks as other transactions wait for the online DDL operation to commit. (See Bug #48652, and Bug #77390)

Because there is some processing work involved with recording the changes made by concurrent DML operations, then applying those changes at the end, an online DDL operation could take longer overall than the old-style mechanism that blocks table access from other sessions. The reduction in raw performance is balanced against better responsiveness for applications that use the table. When evaluating the ideal techniques for changing table structure, consider end-user perception of performance, based on factors such as load times for web pages.

A newly created InnoDB secondary index contains only the committed data in the table at the time the CREATE INDEX or ALTER TABLE statement finishes executing. It does not contain any uncommitted values, old versions of values, or values marked for deletion but not yet removed from the old index.

Performance of In-Place versus Table-Copying DDL Operations

The raw performance of an online DDL operation is largely determined by whether the operation is performed in-place, or requires copying and rebuilding the entire table. See Table 15.10, “Online Status for DDL Operations” to see what kinds of operations can be performed in-place, and any requirements for avoiding table-copy operations.

The performance speedup from in-place DDL applies to operations on secondary indexes, not to the primary key index. The rows of an InnoDB table are stored in a clustered index organized based on the primary key, forming what some database systems call an index-organized table. Because the table structure is closely tied to the primary key, redefining the primary key still requires copying the data.

When an operation on the primary key uses ALGORITHM=INPLACE, even though the data is still copied, it is more efficient than using ALGORITHM=COPY because:

  • No undo logging or associated redo logging is required for ALGORITHM=INPLACE. These operations add overhead to DDL statements that use ALGORITHM=COPY.

  • The secondary index entries are pre-sorted, and so can be loaded in order.

  • The change buffer is not used, because there are no random-access inserts into the secondary indexes.

To judge the relative performance of online DDL operations, you can run such operations on a big InnoDB table using current and earlier versions of MySQL. You can also run all the performance tests under the latest MySQL version, simulating the previous DDL behavior for the before results, by setting the old_alter_table system variable. Issue the statement set old_alter_table=1 in the session, and measure DDL performance to record the before figures. Then set old_alter_table=0 to re-enable the newer, faster behavior, and run the DDL operations again to record the after figures.

For a basic idea of whether a DDL operation does its changes in-place or performs a table copy, look at the rows affected value displayed after the command finishes. For example, here are lines you might see after doing different types of DDL operations:

  • Changing the default value of a column (super-fast, does not affect the table data at all):

    Query OK, 0 rows affected (0.07 sec)
    
  • Adding an index (takes time, but 0 rows affected shows that the table is not copied):

    Query OK, 0 rows affected (21.42 sec)
    
  • Changing the data type of a column (takes substantial time and does require rebuilding all the rows of the table):

    Query OK, 1671168 rows affected (1 min 35.54 sec)
    
    Note

    Changing the data type of a column requires rebuilding all the rows of the table with the exception of changing VARCHAR size, which may be performed using online ALTER TABLE. See Modifying Column Properties for more information.

For example, before running a DDL operation on a big table, you might check whether the operation is fast or slow as follows:

  1. Clone the table structure.

  2. Populate the cloned table with a tiny amount of data.

  3. Run the DDL operation on the cloned table.

  4. Check whether the rows affected value is zero or not. A non-zero value means the operation requires rebuilding the entire table, which might require special planning. For example, you might do the DDL operation during a period of scheduled downtime, or on each replication slave server one at a time.

For a deeper understanding of the reduction in MySQL processing, examine the performance_schema and INFORMATION_SCHEMA tables related to InnoDB before and after DDL operations, to see the int of physical reads, writes, memory allocations, and so on.

Space Requirements for Online DDL Operations

Online DDL operations have the following space requirements:

  • Space for temporary log files

    There is one such log file for each index being created or table being altered. This log file stores data inserted, updated, or deleted in the table during the DDL operation. The temporary log file is extended when needed by the value of innodb_sort_buffer_size, up to the maximum specified by innodb_online_alter_log_max_size. If a temporary log file exceeds the upper size limit, the ALTER TABLE operation fails and all uncommitted concurrent DML operations are rolled back. Thus, a large value for this option allows more DML to happen during an online DDL operation, but also extends the period of time at the end of the DDL operation when the table is locked to apply the data from the log.

    If the operation takes so long, and concurrent DML modifies the table so much, that the size of the temporary online log exceeds the value of the innodb_online_alter_log_max_size configuration option, the online DDL operation fails with a DB_ONLINE_LOG_TOO_BIG error.

  • Space for temporary sort files

    Online DDL operations that rebuild the table write temporary sort files to the MySQL temporary directory ($TMPDIR on Unix, %TEMP% on Windows, or the directory specified by the --tmpdir configuration variable) during index creation. Each temporary sort file is large enough to hold all columns defined for the new secondary index plus the columns that are part of the primary key of the clustered index, and each one is removed as soon as it is merged into the final table or index. Such operations may require temporary space equal to the amount of data in the table plus indexes. An online DDL operation that rebuilds the table can cause an error if the operation uses all of the available disk space on the file system where the data directory (datadir) resides.

    You can use the innodb_tmpdir configuration option to define a separate temporary directory for online DDL operations. The innodb_tmpdir option was introduced to help avoid temporary directory overflows that could occur as a result of large temporary sort files created during online ALTER TABLE operations that rebuild the table.

  • Space for an intermediate table file

    Some online DDL operations that rebuild the table create a temporary intermediate table file in the same directory as the original table as opposed to rebuilding the table in place. An intermediate table file may require space equal to the size of the original table. Operations that rebuild the table in place are noted in 15.12.1, “Online DDL Overview”.

15.12.3 Online DDL SQL Syntax

Typically, you do not need to do anything special to enable online DDL when using the ALTER TABLE statement for InnoDB tables. See Table 15.10, “Online Status for DDL Operations” for the kinds of DDL operations that can be performed in-place, allowing concurrent DML, or both. Some variations require particular combinations of configuration settings or ALTER TABLE clauses.

You can control the various aspects of a particular online DDL operation by using the LOCK and ALGORITHM clauses of the ALTER TABLE statement. These clauses come at the end of the statement, separated from the table and column specifications by commas. The LOCK clause is useful for fine-tuning the degree of concurrent access to the table. The ALGORITHM clause is primarily int