• ADADADADAD

    应用示例荟萃 | performance_schema全方位介绍[ mysql数据库 ]

    mysql数据库 时间:2024-12-03 12:13:22

    作者:文/会员上传

    简介:

    经过前面6个篇幅的学习,相信大家对什么是performance_schema,已经初步形成了一个整体认识,但我想很多同行看完之前的文章之后可能还是一脸懵逼,今天就为大家带来performance_sch

    以下为本文的正文内容,内容仅供参考!本站为公益性网站,复制本文以及下载DOC文档全部免费。

    经过前面6个篇幅的学习,相信大家对什么是performance_schema,已经初步形成了一个整体认识,但我想很多同行看完之前的文章之后可能还是一脸懵逼,今天就为大家带来performance_schema系列的最后一个篇章(全系共7个篇章),在这一期里,我们将为大家列举数十个performance_schema应用示例。下面,请跟随我们一起开始performance_schema系统的学习之旅吧。

    1.利用等待事件排查MySQL性能问题

    通常,在生产服务器上线之前, 我们会对数据库服务器的硬件进行IO基准测试,对数据库进行增删改查的基准测试,建立基线参考数据,以便日后的服务器扩容或架构升级提供数据支撑。在基准测试规划时,我们通常需要选择一款基准测试软件(IO基准测试通常选择fio和iozone,MySQL数据库基准测试通常选择sysbench、tpcc-mysql、workbench等),在使用这些基准测试软件对服务器压测到一个极限值时,我们认为所得数据就是被测试服务器的最高性能。但这还不够,测试性能无法继续提升的原因还可能是因为你的服务器在BIOS设置、硬件搭配、操作系统参数、文件系统策略、数据库配置参数等方面不够优化。所以我们还需要借助一些性能排查手段来找出性能瓶颈所在,以使得我们对数据库服务器一旦上线之后可能的瓶颈点心中有数。以下我们以sysbench基准测试工具压测MySQL数据库为例,介绍如何使用performance_schema的等待事件来排查数据库性能瓶颈所在。

    首先,使用performance_schema配置表启用等待事件的采集与记录

    #启用所有的等待事件的instrumentsadmin@localhost:performance_schema11:47:46>useperformance_schemaDatabasechanged#修改setup_instruments表的enabled和timed字段为yes,表示启用对应的instrumentsadmin@localhost:performance_schema11:48:05>updatesetup_instrumentssetenabled='yes',timed='yes'wherenamelike'wait/%';QueryOK,269rowsaffected(0.00sec)Rowsmatched:323Changed:269Warnings:0#查看修改结果,enabled和timed字段为YES即表示当前instruments已经启用(但此时采集器并不会立即采集事件数据,需要保存这些等待事件的表--consumers,启用之后才会开始采集)admin@localhost:performance_schema11:49:40>select*fromsetup_instrumentswherenamelike'wait/%';+--------------------------------------------------------------------+---------+-------+|NAME|ENABLED|TIMED|+--------------------------------------------------------------------+---------+-------+|wait/synch/mutex/sql/TC_LOG_MMAP::LOCK_tc|YES|YES||wait/synch/mutex/sql/LOCK_des_key_file|YES|YES|............|wait/io/socket/sql/server_tcpip_socket|YES|YES||wait/io/socket/sql/server_unix_socket|YES|YES||wait/io/socket/sql/client_connection|YES|YES||wait/lock/metadata/sql/mdl|YES|YES|+--------------------------------------------------------------------+---------+-------+323rowsinset(0.01sec)#启用等待事件的consumersadmin@localhost:performance_schema11:48:21>updatesetup_consumerssetenabled='yes'wherenamelike'%wait%';QueryOK,3rowsaffected(0.00sec)Rowsmatched:3Changed:3Warnings:0admin@localhost:performance_schema11:49:20>select*fromsetup_consumerswherenamelike'%wait%';+---------------------------+---------+|NAME|ENABLED|+---------------------------+---------+|events_waits_current|YES||events_waits_history|YES||events_waits_history_long|YES|+---------------------------+---------+3rowsinset(0.00sec)

    然后,使用sysbench对数据库执行加压,并逐渐增加并发线程数,直到tps、qps不再随着线程数的增加而增加为止。

    sysbench--test=oltp--db-driver=mysql--mysql-table-engine=innodb--mysql-host=10.10.10.10--mysql-port=3306--mysql-db=sbtest--mysql-user='qbench'--mysql-password='qbench'--test=/usr/share/doc/sysbench/tests/db/oltp.lua--oltp-table-size=5000000--oltp-tables-count=8--num-threads=16--max-time=1800--max-requests=0--report-interval=1run............[111s]threads:16,tps:52.99,reads/s:668.93,writes/s:171.98,responsetime:629.52ms(95%)[112s]threads:16,tps:42.00,reads/s:650.93,writes/s:202.98,responsetime:688.46ms(95%)............

    从sysbench的输出结果中,我们可以看到在16个并发线程oltp压力下,tps只能跑到100不到,且延迟在600ms+,说明存在严重的性能瓶颈(或者在MySQL内部发生了严重的互斥等待,或者硬件设备严重的性能不足),现在,我们先使用操作系统命令查看硬件负载情况。

    #top命令查看到CPU资源绝大部分都消耗在了%wa上,说明IO设备性能出现严重不足[root@localhost~]#toptop-18:59:03up7:02,3users,loadaverage:4.28,5.82,4.22Tasks:186total,1running,185sleeping,0stopped,0zombieCpu0:4.1%us,8.5%sy,0.0%ni,11.9%id,75.4%wa,0.0%hi,0.0%si,0.0%stCpu1:4.0%us,13.1%sy,0.0%ni,17.5%id,65.0%wa,0.0%hi,0.3%si,0.0%stCpu2:9.4%us,32.1%sy,0.0%ni,2.3%id,55.5%wa,0.0%hi,0.7%si,0.0%stCpu3:3.0%us,5.3%sy,0.0%ni,31.0%id,60.0%wa,0.0%hi,0.7%si,0.0%stMem:8053664ktotal,1684236kused,6369428kfree,87868kbuffersSwap:2031612ktotal,0kused,2031612kfree,150680kcached#iostat命令查看磁盘负载,通过%util列可以看到,磁盘处于100%满负载状态avg-cpu:%user%nice%system%iowait%steal%idle1.770.002.2895.700.000.25Device:rrqm/swrqm/sr/sw/srsec/swsec/savgrq-szavgqu-szawaitsvctm%utildm-20.000.00277.00160.008864.002774.0026.6347.84112.982.29100.10avg-cpu:%user%nice%system%iowait%steal%idle5.050.0011.6264.140.0019.19Device:rrqm/swrqm/sr/sw/srsec/swsec/savgrq-szavgqu-szawaitsvctm%utildm-20.000.00267.00244.008544.004643.0025.8128.2040.291.96100.00

    通过查询系统负载,一眼就可以看出来是由于磁盘性能严重不足导致的,但是,在数据库内部的事件信息是如何体现的呢(注意:如果你没有足够的performance_schema使用经验,此时是绝好的学习积累的机会,不要错过,也许哪一天操作系统负载并不能看出来端倪的时候,这些事件信息能帮上大忙)

    #为了方便查询等待事件统计,我们可以先创建一个视图,用于实时统计当前等待事件(非历史数据)admin@localhost:performance_schema12:14:14>createviewsys.test_waitsasselectsum(TIMER_WAIT)asTIMER_WAIT,sum(NUMBER_OF_BYTES)asNUMBER_OF_BYTES,EVENT_NAME,OPERATIONfromevents_waits_currentwhereEVENT_NAME!='idle'groupbyEVENT_NAME,OPERATION;QueryOK,0rowsaffected(0.04sec)#使用前面创建的视图进行查询,对这个视图查询结果进行降序排序查询。从下面的查询结果中,我们可以看到时间开销排名前5的有4个都是与IO相关的等待,剩下1个是binlog相关的互斥等待admin@localhost:performance_schema12:30:38>selectsys.format_time(TIMER_WAIT),sys.format_bytes(NUMBER_OF_BYTES),EVENT_NAME,OPERATIONfromsys.test_waitswheresys.format_time(TIMER_WAIT)notregexp'ns|us'orderbyTIMER_WAITdesc;+-----------------------------+-----------------------------------+------------------------------------------------+------------+|sys.format_time(TIMER_WAIT)|sys.format_bytes(NUMBER_OF_BYTES)|EVENT_NAME|OPERATION|+-----------------------------+-----------------------------------+------------------------------------------------+------------+|16.60s|224.00KiB|wait/io/file/innodb/innodb_data_file|read||16.05s|553bytes|wait/io/table/sql/handler|fetch||1.96s|NULL|wait/io/file/sql/binlog|sync||1.96s|NULL|wait/synch/cond/sql/MYSQL_BIN_LOG::update_cond|timed_wait||1.85s|1.34KiB|wait/io/file/sql/binlog|write||56.66ms|NULL|wait/io/file/innodb/innodb_log_file|sync|+-----------------------------+-----------------------------------+------------------------------------------------+------------+6rowsinset(0.01sec)#当然,你也可以直接查询events_waits_current表(返回数据行数可能比较多,且查询结果并没有做分组聚合,是逐行的事件记录数据)admin@localhost:performance_schema11:59:25>selectTHREAD_ID,EVENT_NAME,sys.format_time(TIMER_WAIT),INDEX_NAME,NESTING_EVENT_TYPE,OPERATION,NUMBER_OF_BYTESfromevents_waits_currentwhereEVENT_NAME!='idle'orderbyTIMER_WAITdesc;+-----------+------------------------------------------------+-----------------------------+------------+--------------------+------------+-----------------+|THREAD_ID|EVENT_NAME|sys.format_time(TIMER_WAIT)|INDEX_NAME|NESTING_EVENT_TYPE|OPERATION|NUMBER_OF_BYTES|+-----------+------------------------------------------------+-----------------------------+------------+--------------------+------------+-----------------+|115|wait/io/table/sql/handler|169.48ms|PRIMARY|STATEMENT|fetch|39||115|wait/io/file/innodb/innodb_data_file|169.48ms|NULL|WAIT|read|16384||101|wait/io/table/sql/handler|93.76ms|PRIMARY|STATEMENT|fetch|39||101|wait/io/file/innodb/innodb_data_file|93.76ms|NULL|WAIT|read|16384||111|wait/io/file/innodb/innodb_data_file|73.08ms|NULL|STATEMENT|read|16384||103|wait/io/file/innodb/innodb_data_file|63.13ms|NULL|STATEMENT|read|16384||106|wait/io/file/innodb/innodb_data_file|53.24ms|NULL|STATEMENT|read|16384||113|wait/io/table/sql/handler|51.90ms|PRIMARY|STATEMENT|fetch|39||113|wait/io/file/innodb/innodb_data_file|51.90ms|NULL|WAIT|read|16384||49|wait/synch/cond/sql/MYSQL_BIN_LOG::update_cond|27.48ms|NULL|STATEMENT|timed_wait|NULL|............57rowsinset(0.00sec)

    从上述等待事件的查询结果中,我们可以非常清晰地看到,事务大多数的延迟时间花在了等待IO上(主要是undo log、redo log,独立表空间文件,binlog的fetch和read系统调用),说明IO设备可能出现了严重的性能瓶颈,这里与操作系统命令查看到的磁盘性能严重不足相对应。

    结论:通过以上测试数据表明,MySQL的性能严重低下的原因是因为磁盘性能严重不足成为了瓶颈(一般情况下,4个core的cpu在内存和磁盘不构成瓶颈的情况下可以达到800+ tps才可能会构成瓶颈)

    针对IO性能不足,建议优化策略:

      更换IO性能更好的设备

      新增2个独立的相同设备,把MySQL 中的redo log、binlog、其他data file分别放在3个独立的IO设备上,以便数据库中的随机IO和顺序IO不会因为相互争抢资源而导致IO等待

      PS:

      当然,你也许会说,我们在这个案例里故意使用一台配置很差的服务器。是的没错。但我们可以去思考一个问题:performance_schema到底能够对我们使用MySQL提供多大帮助呢?对于目前来讲,互联网上并不能找到太多靠谱的performance_schema使用经验,需要我们不断地去挖掘。我们建议有条件的同行可以准备两台测试服务器(一台低配置,一台高配置服务器),通过对比测试数据你就能得出performance_schema的使用经验了,正所谓没有对比就没有伤害。

      2.锁问题排查

      2.1 找出谁持有全局读锁

      全局读锁通常是由flush table with read lock;这类语句添加,这类语句通常是在各种备份工具为了拿到一致性备份时使用,另外,在具有主从复制架构的环境中做主备切换时也常常使用,除了这两种情况之外,还有一种情况也是最难排查的一种情况,那就是线上系统权限约束不规范的时候,各种人员使用的数据库帐号都具有RELOAD权限时,都可以对数据库加全局读锁。

      在MySQL 5.7之前的版本,要排查谁持有全局读锁通常在数据库层面是很难直接查询到有用数据(innodb_locks表也只能记录innodb层的锁信息,而全局读锁是server层的锁,所以也无法查询到),从MySQL 5.7开始提供表performance_schema.metadata_locks表记录一些Server层的锁信息(包括全局读锁和MDL锁等),下面我们通过一个示例来演示然后使用performance_schema来找出谁持有全局读锁。

      首先,开启第一个会话,执行全局读锁。

      #执行加锁语句root@localhost:sbtest12:28:30>flushtablewithreadlock;QueryOK,0rowsaffected(0.00sec)#查询以下加锁线程的processid,以便后续排查过程好对应root@localhost:sbtest12:31:48>selectconnection_id();+-----------------+|connection_id()|+-----------------+|4|+-----------------+1rowinset(0.00sec)

      现在,我们开启第二个会话执行任意可能对数据造成修改的语句,我们就以update操作为例吧。

      root@localhost:sbtest12:42:36>usesbtestDatabasechangedroot@localhost:sbtest12:42:38>select*fromsbtest1limit1\G;***************************1.row***************************id:21k:2483476c:09279210219-37745839908-56185699327-79477158641-86711242956-61449540392-42622804506-61031512845-36718422840-11028803849pad:96813293060-05308009118-09223341195-19224109585-455981618481rowinset(0.00sec)ERROR:Noqueryspecifiedroot@localhost:sbtest12:42:39>selectconnection_id();+-----------------+|connection_id()|+-----------------+|5|+-----------------+1rowinset(0.00sec)root@localhost:sbtest12:42:44>updatesbtest1setpad='xxx'whereid=21;操作被阻塞

      现在,我们开启第三个会话,开始使用一些手段进行排查。

      root@localhost:(none)12:42:25>selectconnection_id();+-----------------+|connection_id()|+-----------------+|16|+-----------------+1rowinset(0.00sec)#查询processlist信息,这里只能看到processid为5的线程State为Waitingforglobalreadlock,表示正在等待全局读锁root@localhost:(none)12:43:11>showprocesslist;+----+-------+---------------------+--------+-------------+-------+---------------------------------------------------------------+------------------------------------------+|Id|User|Host|db|Command|Time|State|Info|+----+-------+---------------------+--------+-------------+-------+---------------------------------------------------------------+------------------------------------------+|3|qfsys|192.168.2.168:41042|NULL|BinlogDump|11457|Masterhassentallbinlogtoslave;waitingformoreupdates|NULL||4|root|localhost|sbtest|Sleep|234||NULL||5|root|localhost|sbtest|Query|26|Waitingforglobalreadlock|updatesbtest1setpad='xxx'whereid=21||16|root|localhost|NULL|Query|0|starting|showprocesslist|+----+-------+---------------------+--------+-------------+-------+---------------------------------------------------------------+------------------------------------------+4rowsinset(0.00sec)#继续查询information_schema.innodb_locks、innodb_lock_waits、innodb_trx表,发现三个表均为空root@localhost:(none)12:59:30>select*frominformation_schema.innodb_locks;Emptyset,1warning(0.00sec)root@localhost:(none)12:59:40>select*frominformation_schema.innodb_lock_waits;Emptyset,1warning(0.00sec)root@localhost:(none)12:59:43>select*frominformation_schema.innodb_trx\GEmptyset(0.00sec)#再使用showengineinnodbstatus;查看一把(这里只需要看TRANSACTION段落即可),仍然无任何有用的锁信息root@localhost:(none)12:59:48>showengineinnodbstatus;......=====================================2018-06-2513:01:430x7fe55ded8700INNODBMONITOROUTPUT=====================================......------------TRANSACTIONS------------Trxidcounter2527502Purgedonefortrx'sn:o<2527500undon:o<0state:runningbutidleHistorylistlength3LISTOFTRANSACTIONSFOREACHSESSION:---TRANSACTION422099353083504,notstarted0lockstruct(s),heapsize1136,0rowlock(s)---TRANSACTION422099353082592,notstarted0lockstruct(s),heapsize1136,0rowlock(s)---TRANSACTION422099353081680,notstarted0lockstruct(s),heapsize1136,0rowlock(s)--------FILEI/O......

      通过上面的常规手段查询下来,无任何有用信息,这个时候,有gdb调试经验的老鸟估计就要开始使用gdb,strace,pstack什么的命令查看MySQL 调用栈、线程信息什么的了,但这对于没有C语言基础的人来说,基本上是看天书,好在从MySQL 5.7版本开始,提供performance_schema.metadata_locks表,该表记录了各种Server层的锁信息(包括全局读锁和MDL锁信息),下面我们开启第三个会话查询该表试试看。

      #我们还可以通过performance_schema.metadata_locks表来排查谁持有全局读锁,全局读锁通常在该表记录着同一个会话的OBJECT_TYPE为global和commit、LOCK_TYPE都为SHARED的两把显式锁,如下root@localhost:(none)01:01:43>select*fromperformance_schema.metadata_lockswhereOWNER_THREAD_ID!=sys.ps_thread_id(connection_id())\G;***************************1.row***************************OBJECT_TYPE:GLOBALOBJECT_SCHEMA:NULLOBJECT_NAME:NULLOBJECT_INSTANCE_BEGIN:140621322913984LOCK_TYPE:SHARED#共享锁LOCK_DURATION:EXPLICIT#显式LOCK_STATUS:GRANTED#已授予SOURCE:lock.cc:1110OWNER_THREAD_ID:94#持有锁的内部线程ID为94OWNER_EVENT_ID:16***************************2.row***************************OBJECT_TYPE:COMMITOBJECT_SCHEMA:NULLOBJECT_NAME:NULLOBJECT_INSTANCE_BEGIN:140621322926064LOCK_TYPE:SHARED#共享锁LOCK_DURATION:EXPLICIT#显式LOCK_STATUS:GRANTED#已授予SOURCE:lock.cc:1194OWNER_THREAD_ID:94#持有锁的内部线程ID为94OWNER_EVENT_ID:16***************************3.row***************************OBJECT_TYPE:GLOBALOBJECT_SCHEMA:NULLOBJECT_NAME:NULLOBJECT_INSTANCE_BEGIN:140621391527216LOCK_TYPE:INTENTION_EXCLUSIVE#意向排它锁LOCK_DURATION:STATEMENT#语句LOCK_STATUS:PENDING#状态为pending,表示正在等待被授予SOURCE:sql_base.cc:3190OWNER_THREAD_ID:95#被阻塞的内部线程ID为95OWNER_EVENT_ID:383rowsinset(0.00sec)#查看processid为4,5各自对应的内部线程ID是多少root@localhost:(none)01:33:36>selectsys.ps_thread_id(4);+---------------------+|sys.ps_thread_id(4)|+---------------------+|94|#processid=4的线程对应的内部线程ID正好为94,说明就是processid=4的线程持有了全局读锁+---------------------+1rowinset(0.00sec)root@localhost:(none)01:34:10>selectsys.ps_thread_id(5);+---------------------+|sys.ps_thread_id(5)|+---------------------+|95|#procesid=5的线程对应的内部线程正好是95,说明在等待全局读锁的就是processid=5的线程+---------------------+1rowinset(0.00sec)

      如果是生产环境,综合上述信息,通过show processlist信息中对应的process id=4的行记录中找到user、host、db信息,大致判断一下是属于什么业务用途,找相关人员询问清楚,该杀掉就杀掉,顺便讨论下今后如何避免这个问题。

      2.2 找出谁持有MDL锁

      我们可能经常会发现执行语句时被阻塞等待MDL锁,例如:使用show processlist;语句查看线程信息时可能会发现State列为"Waiting for table metadata lock",碰到这种情况我们应该如何去排查是谁持有了MDL锁没有释放呢,下面我们尝试着进行MDL锁等待场景模拟(mdl锁记录对应的instruments为wait/lock/metadata/sql/mdl,默认未启用,对应的consumers为performance_schema.metadata_locks,在setup_consumers只受全局配置项global_instrumentation控制,默认启用)。

      首先,打开两个会话,分别执行如下语句。

      #会话1,显式开启一个事务,并执行一个update语句更新sbtest1表不提交root@localhost:sbtest12:26:25>usesbtestDatabasechangedroot@localhost:sbtest12:26:30>begin;QueryOK,0rowsaffected(0.00sec)root@localhost:sbtest12:26:32>updatesbtest1setpad='yyy'whereid=1;QueryOK,1rowaffected(0.00sec)Rowsmatched:1Changed:1Warnings:0#会话2,对sbtest1表执行DDL语句添加一个普通索引root@localhost:sbtest12:42:50>usesbtestDatabasechangedroot@localhost:sbtest12:42:56>altertablesbtest1addindexi_c(c);#被阻塞

      此时,我们另外开启一个会话3,使用show processlist语句查询线程信息,可以发现update语句正在等待MDL锁(Waiting for table metadata lock)。

      root@localhost:(none)12:37:49>showprocesslist;+----+------+-----------+--------+---------+------+---------------------------------+--------------------------------------+|Id|User|Host|db|Command|Time|State|Info|+----+------+-----------+--------+---------+------+---------------------------------+--------------------------------------+|92|root|localhost|sbtest|Query|121|Waitingfortablemetadatalock|altertablesbtest1addindexi_c(c)||93|root|localhost|NULL|Query|0|starting|showprocesslist||94|root|localhost|sbtest|Sleep|1078||NULL|+----+------+-----------+--------+---------+------+---------------------------------+--------------------------------------+3rowsinset(0.00sec)

      在MySQL 5.7版本之前,我们不能从数据库层面很直观地去查询谁持有MDL锁信息(当然你可以说你会gdb之类的工具来查看,但这类工具的使用需要具有一定c编程语言基础),现在,我们可以通过查询performance_schema.metadata_locks表得知MDL锁信息,发现有5行MDL锁记录,其中, 第一行为sbtest.sbtest1表的SHARED_WRITE锁,处于GRANTED状态,为136线程持有(对应process id为94),其他后续4行中,有sbtest.sbtest1表的SHARED_UPGRADABLE、EXCLUSIVE锁,其中SHARED_UPGRADABLE处于GRANTED状态,EXCLUSIVE处于PENDING状态,为134线程持有(对应process id为92)。说明134线程在等待MDL锁。

      root@localhost:(none)01:23:05>select*fromperformance_schema.metadata_lockswhereOWNER_THREAD_ID!=sys.ps_thread_id(connection_id())\G;***************************1.row***************************OBJECT_TYPE:TABLEOBJECT_SCHEMA:sbtestOBJECT_NAME:sbtest1OBJECT_INSTANCE_BEGIN:139886013386816LOCK_TYPE:SHARED_WRITELOCK_DURATION:TRANSACTIONLOCK_STATUS:GRANTEDSOURCE:sql_parse.cc:5996OWNER_THREAD_ID:136OWNER_EVENT_ID:721***************************2.row***************************OBJECT_TYPE:GLOBALOBJECT_SCHEMA:NULLOBJECT_NAME:NULLOBJECT_INSTANCE_BEGIN:139886348911600LOCK_TYPE:INTENTION_EXCLUSIVELOCK_DURATION:STATEMENTLOCK_STATUS:GRANTEDSOURCE:sql_base.cc:5497OWNER_THREAD_ID:134OWNER_EVENT_ID:4667***************************3.row***************************OBJECT_TYPE:SCHEMAOBJECT_SCHEMA:sbtestOBJECT_NAME:NULLOBJECT_INSTANCE_BEGIN:139886346748096LOCK_TYPE:INTENTION_EXCLUSIVELOCK_DURATION:TRANSACTIONLOCK_STATUS:GRANTEDSOURCE:sql_base.cc:5482OWNER_THREAD_ID:134OWNER_EVENT_ID:4667***************************4.row***************************OBJECT_TYPE:TABLEOBJECT_SCHEMA:sbtestOBJECT_NAME:sbtest1OBJECT_INSTANCE_BEGIN:139886346749984LOCK_TYPE:SHARED_UPGRADABLELOCK_DURATION:TRANSACTIONLOCK_STATUS:GRANTEDSOURCE:sql_parse.cc:5996OWNER_THREAD_ID:134OWNER_EVENT_ID:4669***************************5.row***************************OBJECT_TYPE:TABLEOBJECT_SCHEMA:sbtestOBJECT_NAME:sbtest1OBJECT_INSTANCE_BEGIN:139886348913168LOCK_TYPE:EXCLUSIVELOCK_DURATION:TRANSACTIONLOCK_STATUS:PENDINGSOURCE:mdl.cc:3891OWNER_THREAD_ID:134OWNER_EVENT_ID:47485rowsinset(0.00sec)

      通过上述数据,我们知道了是哪个线程持有了MDL锁,通过show processlist语句的查询结果可以看到process id为94的线程已经长时间处于sleep状态,但是我们在这里并不能看到这个线程执行了什么语句,我们可能需要查询一下information_schema.innodb_trx表,确认一下该线程是否存在着一个没有提交的事务。如下,通过查询该表发现process id为94(trx_mysql_thread_id=94)的线程确实有一个未提交的事务,但并没有太多的有用信息,除了一个事务开始时间和process id(trx_started: 2018-01-14 01:19:25, trx_mysql_thread_id: 94)

      root@localhost:(none)01:32:17>select*frominformation_schema.innodb_trx\G;***************************1.row***************************trx_id:2452892trx_state:RUNNINGtrx_started:2018-01-1401:19:25trx_requested_lock_id:NULLtrx_wait_started:NULLtrx_weight:3trx_mysql_thread_id:94......1rowinset(0.00sec)

      此时,从我们掌握的所有数据信息来看,虽然知道了是136线程的事务没有提交导致的134线程发生MDL锁等待,但是我们并不知道136线程正在做什么事情。我们当然可以kill掉136线程让134线程继续往下执行,但是我们不知道136线程在执行什么语句,就无法找到相关的开发人员进行优化,下次我们还可能再次碰到类似的问题,所以,我们还可以借助performance_schema.events_statements_current表来查询某个线程正在执行或者说最后一次执行完成的语句事件信息(这里信息并不一定可靠,因为该表中对于每个线程只能记录当前正在执行和最近一次执行完成的语句事件信息,一旦这个线程执行新的语句,信息就会被覆盖),如下:

      root@localhost:(none)01:47:53>select*fromperformance_schema.events_statements_currentwherethread_id=136\G;***************************1.row***************************THREAD_ID:136EVENT_ID:715END_EVENT_ID:887EVENT_NAME:statement/sql/updateSOURCE:socket_connection.cc:101......SQL_TEXT:updatesbtest1setpad='yyy'whereid=1DIGEST:69f516aa8eaa67fd6e7bfd3352de5d58DIGEST_TEXT:UPDATE`sbtest1`SET`pad`=?WHERE`id`=?CURRENT_SCHEMA:sbtest......MESSAGE_TEXT:Rowsmatched:1Changed:1Warnings:0......1rowinset(0.00sec)

      从performance_schema.events_statements_current 表的查询信息中,通过SQL_TEXT字段我们可以清晰地看到该线程正在执行的SQL语句是什么。如果是生产环境,现在,你可以去找相关的开发人员交涉,下次碰到类似的语句必须及时提交,避免下次再发生类似的问题。

      2.3 找出谁持有表级锁

      表级锁对应的instruments(wait/lock/table/sql/handler)默认启用,对应的consumers表为performance_schema.table_handles在setup_consumers只受全局配置项global_instrumentation控制,默认启用。所以,默认情况下只需要设置系统配置参数performance_schema=ON即可,下面我们通过一个示例演示如何找出谁持有表级锁。

      首先,开启两个会话,第一个会话对一个表(innodb引擎)执行显式加表级锁,第二个会话对该表执行DML语句操作。

      #会话1加表级锁root@localhost:sbtest02:15:17>usesbtestDatabasechangedroot@localhost:sbtest02:40:27>selectconnection_id();+-----------------+|connection_id()|+-----------------+|18|+-----------------+1rowinset(0.00sec)root@localhost:sbtest02:40:29>locktablesbtest1read;QueryOK,0rowsaffected(0.00sec)#会话2对该表执行update更新root@localhost:sbtest10:26:37>usesbtestDatabasechangedroot@localhost:sbtest02:15:33>selectconnection_id();+-----------------+|connection_id()|+-----------------+|19|+-----------------+1rowinset(0.00sec)root@localhost:sbtest02:40:34>updatesbtest1setpad='xxx'whereid=1;#被阻塞

      然后,开启第三个会话,使用show processlist语句查询线程信息,可以发现update语句正在等待MDL锁(Waiting for table metadata lock)

      root@localhost:(none)02:40:14>showprocesslist;+----+-------+---------------------+--------+-------------+-------+---------------------------------------------------------------+-----------------------------------------+|Id|User|Host|db|Command|Time|State|Info|+----+-------+---------------------+--------+-------------+-------+---------------------------------------------------------------+-----------------------------------------+|3|qfsys|192.168.2.168:41042|NULL|BinlogDump|18565|Masterhassentallbinlogtoslave;waitingformoreupdates|NULL||18|root|localhost|sbtest|Sleep|67||NULL||19|root|localhost|sbtest|Query|51|Waitingfortablemetadatalock|updatesbtest1setpad='xxx'whereid=1||20|root|localhost|NULL|Query|0|starting|showprocesslist|+----+-------+---------------------+--------+-------------+-------+---------------------------------------------------------------+-----------------------------------------+4rowsinset(0.00sec)

      如上所示,既然是等待MDL锁,那么我们在会话3查询performance_schema.metadata_locks表,记录的顺序代表持有锁的时间顺序,如下:

      root@localhost:(none)02:41:41>select*fromperformance_schema.metadata_lockswhereOWNER_THREAD_ID!=sys.ps_thread_id(connection_id())\G;***************************1.row***************************OBJECT_TYPE:TABLEOBJECT_SCHEMA:sbtestOBJECT_NAME:sbtest1OBJECT_INSTANCE_BEGIN:140622530920576LOCK_TYPE:SHARED_READ_ONLYLOCK_DURATION:TRANSACTIONLOCK_STATUS:GRANTEDSOURCE:sql_parse.cc:5996OWNER_THREAD_ID:113#内部ID为113的线程被授予了LOCK_TYPE:SHARED_READ_ONLY,持有该所的线程不允许其他线程修改sbtest1表的数据OWNER_EVENT_ID:11***************************2.row***************************OBJECT_TYPE:GLOBALOBJECT_SCHEMA:NULLOBJECT_NAME:NULLOBJECT_INSTANCE_BEGIN:140620517607728LOCK_TYPE:INTENTION_EXCLUSIVELOCK_DURATION:STATEMENTLOCK_STATUS:GRANTEDSOURCE:sql_base.cc:3190OWNER_THREAD_ID:114#内部ID为114的线程被授予了LOCK_TYPE:INTENTION_EXCLUSIVE,但这只是个意向锁OWNER_EVENT_ID:12***************************3.row***************************OBJECT_TYPE:TABLEOBJECT_SCHEMA:sbtestOBJECT_NAME:sbtest1OBJECT_INSTANCE_BEGIN:140620517607824LOCK_TYPE:SHARED_WRITELOCK_DURATION:TRANSACTIONLOCK_STATUS:PENDINGSOURCE:sql_parse.cc:5996OWNER_THREAD_ID:114#内部ID为114的线程正在等待LOCK_TYPE:SHARED_WRITE被授予OWNER_EVENT_ID:123rowsinset(0.00sec)

      排查陷入僵局,我们知道MDL锁非常常见,对表的绝大部分的操作都会先对表加MDL锁(根据performance_schema.metadata_locks表中记录的锁信息也不顶用了),通常看到这些信息时,我们可能会立马联想到,需要去查询一下information_schema下的三张关于Innodb引擎的锁和事务信息表(INNODB_LOCK_WAITS、INNODB_LOCKS、INNODB_TRX),我们尝试着查看一下这三张表(会话3执行),可以发现都没有记录。

      root@localhost:(none)02:41:53>select*frominformation_schema.INNODB_TRX;Emptyset(0.00sec)root@localhost:(none)02:42:58>select*frominformation_schema.INNODB_LOCKS;Emptyset,1warning(0.00sec)root@localhost:(none)02:43:02>select*frominformation_schema.INNODB_LOCK_WAITS;Emptyset,1warning(0.00sec)

      当然,可能有的人会说,就4个线程,第二个会话的"Command "为sleep,应该可能是它,把它kill试试看。是的,在该案例中确实可以做这个尝试,但如果是生产环境中有数十上百个正常的长连接处于sleep状态呢?这个时候我们就不能挨个去尝试了,这个时候我们可以尝试着去查询一些表级别的锁信息(通过会话3查询performance_schema.table_handles表),如下:

      root@localhost:(none)02:43:06>select*fromperformance_schema.table_handleswhereOWNER_THREAD_ID!=0\G;***************************1.row***************************OBJECT_TYPE:TABLEOBJECT_SCHEMA:sbtestOBJECT_NAME:sbtest1OBJECT_INSTANCE_BEGIN:140622530923216OWNER_THREAD_ID:113OWNER_EVENT_ID:11INTERNAL_LOCK:NULLEXTERNAL_LOCK:READEXTERNAL#发现内部ID为113的线程持有了sbtest1表的READEXTERNAL表级锁,这个也是为什么内部ID为114的线程无法获取到MDL写锁的原因1rowinset(0.00sec)

      通过上述查询到的相关数据,113线程对sbtest1表显式加了表级读锁,而且长时间处于sleep状态,但我们并不知道该线程正在执行什么SQL语句,我们可以通过performance_schema.events_statements_current表查询,如下:

      root@localhost:(none)02:43:22>select*fromperformance_schema.events_statements_currentwherethread_id=113\G;***************************1.row***************************THREAD_ID:113EVENT_ID:10END_EVENT_ID:10EVENT_NAME:statement/sql/lock_tablesSOURCE:socket_connection.cc:101TIMER_START:18503556405463000TIMER_END:18503556716572000TIMER_WAIT:311109000LOCK_TIME:293000000SQL_TEXT:locktablesbtest1read#这里可以看到,内部ID为113的线程对表sbtest1执行了加读锁语句DIGEST:9f987e807ca36e706e33275283b5572bDIGEST_TEXT:LOCKTABLE`sbtest1`READCURRENT_SCHEMA:sbtest......1rowinset(0.00sec)

      从performance_schema.events_statements_current 表的查询信息中,通过SQL_TEXT字段我们可以清晰地看到该线程正在执行的SQL语句是什么。如果是生产环境,现在,你可以去找相关的开发人员确认,如果没有什么特殊操作,就可以尝试着kill掉这个线程(会话3执行,processlist_id为18),同时针对这个问题进行优化,避免下次再发生类似的问题。

      #如何知道内部ID113对应的processid是多少呢?我们可以通过performance_schema.threads表查询root@localhost:(none)02:48:19>selectprocesslist_idfromperformance_schema.threadswherethread_id=113;+----------------+|processlist_id|+----------------+|18|+----------------+1rowinset(0.00sec)#执行killroot@localhost:(none)02:48:24>kill18;QueryOK,0rowsaffected(0.00sec)root@localhost:(none)02:48:40>showprocesslist;+----+-------+---------------------+--------+-------------+-------+---------------------------------------------------------------+------------------+|Id|User|Host|db|Command|Time|State|Info|+----+-------+---------------------+--------+-------------+-------+---------------------------------------------------------------+------------------+|3|qfsys|192.168.2.168:41042|NULL|BinlogDump|18994|Masterhassentallbinlogtoslave;waitingformoreupdates|NULL||19|root|localhost|sbtest|Sleep|480||NULL||20|root|localhost|NULL|Query|0|starting|showprocesslist|+----+-------+---------------------+--------+-------------+-------+---------------------------------------------------------------+------------------+3rowsinset(0.00sec)#返回执行update语句的会话2,语句已经执行成功root@localhost:sbtest02:40:34>updatesbtest1setpad='xxx'whereid=1;QueryOK,0rowsaffected(7min50.23sec)Rowsmatched:0Changed:0Warnings:0
      2.4 找出谁持有行级锁

      该案例中涉及的performance_schema.data_lock表在MySQL 8.0中新增,在8.0之前的版本中不支持,在这里仅作为针对MySQL 5.7的performance_schema的一个延伸学习

        如果一个事务长时间未提交,我们虽然可以从information_schema.innodb_trx、performance_schema.events_transactions_current等表中查询到相应的事务信息,但却无从知道这个事务持有了哪些锁。虽然information_schema.innodb_locks表是用于记录事务锁信息的,但需要在两个不同的事务发生锁等待时该表才会记录两个事务的锁信息。从8.0开始,在performance_schema中提供了一个data_locks表用于记录任意事务的锁信息(同时废弃了information_schema.innodb_locks表),不需要有锁等待关系存在(注意,该表中只记录innodb存储引擎层的锁),如下

        首先,我们在8.0中打开一个会话(会话1),显式开启一个事务。

        root@localhost:xiaoboluo01:26:09>usexiaoboluoDatabasechangedroot@localhost:xiaoboluo01:26:19>select*fromt_luoxiaobolimit1;+----+------+---------------------+|id|test|datet_time|+----+------+---------------------+|2|1|2017-09-0601:11:59|+----+------+---------------------+1rowinset(0.00sec)root@localhost:xiaoboluo01:26:21>begin;QueryOK,0rowsaffected(0.00sec)root@localhost:xiaoboluo01:26:23>updatet_luoxiaobosetdatet_time=now()whereid=2;QueryOK,1rowaffected(0.00sec)Rowsmatched:1Changed:1Warnings:0

        打开另外一个会话(会话2)查询data_locks表

        root@localhost:performance_schema01:27:21>select*fromdata_locks\G;***************************1.row***************************ENGINE:INNODBENGINE_LOCK_ID:55562:62ENGINE_TRANSACTION_ID:55562THREAD_ID:54#持有线程内部IDEVENT_ID:85OBJECT_SCHEMA:xiaoboluo#库名OBJECT_NAME:t_luoxiaobo#表名PARTITION_NAME:NULLSUBPARTITION_NAME:NULLINDEX_NAME:NULL#索引名称OBJECT_INSTANCE_BEGIN:140439793477144LOCK_TYPE:TABLE#表级锁LOCK_MODE:IX#IX锁LOCK_STATUS:GRANTED#被授予状态LOCK_DATA:NULL***************************2.row***************************ENGINE:INNODBENGINE_LOCK_ID:55562:2:4:2ENGINE_TRANSACTION_ID:55562THREAD_ID:54#持有锁线程内部IDEVENT_ID:85OBJECT_SCHEMA:xiaoboluo#库名OBJECT_NAME:t_luoxiaobo#表名PARTITION_NAME:NULLSUBPARTITION_NAME:NULLINDEX_NAME:PRIMARY#索引为主键OBJECT_INSTANCE_BEGIN:140439793474104LOCK_TYPE:RECORD#记录锁LOCK_MODE:X#排它锁LOCK_STATUS:GRANTED#被授予状态LOCK_DATA:2#被锁定的数据记录,这里的记录对应的是INDEX_NAME:PRIMARY的value2rowsinset(0.00sec)

        从查询结果中我们可以看到,有两行锁记录,第一行是对表t_luoxiaobo的IX锁,状态为GRANTED,第二行为使用主键索引的X锁记录锁,状态为GRANTED 。

        现在,我们模拟两个DML发生锁等待的场景。我们新开一个会话(会话3),在会话1中的事务未提交的情况下,会话3对表t_luoxiaobo执行同样的操作。

        root@localhost:(none)09:34:49>usexiaoboluoDatabasechangedroot@localhost:xiaoboluo09:34:54>begin;QueryOK,0rowsaffected(0.00sec)root@localhost:xiaoboluo09:34:55>updatet_luoxiaobosetdatet_time=now()whereid=2;#被阻塞

        回到会话2中查询data_locks表,可以发现有4行锁记录了

        root@localhost:(none)09:35:18>select*fromperformance_schema.data_locks\G;***************************1.row***************************......THREAD_ID:55......LOCK_TYPE:TABLELOCK_MODE:IXLOCK_STATUS:GRANTEDLOCK_DATA:NULL***************************2.row***************************ENGINE:INNODBENGINE_LOCK_ID:55563:2:4:2ENGINE_TRANSACTION_ID:55563THREAD_ID:55#内部线程IDEVENT_ID:8OBJECT_SCHEMA:xiaoboluoOBJECT_NAME:t_luoxiaoboPARTITION_NAME:NULLSUBPARTITION_NAME:NULLINDEX_NAME:PRIMARY#发生锁的索引名称OBJECT_INSTANCE_BEGIN:140439793480168LOCK_TYPE:RECORD#记录锁LOCK_MODE:X#排它锁LOCK_STATUS:WAITING#正在等待锁被授予LOCK_DATA:2#锁定的索引value,这里与内部ID为54线程持有的主键值为2的X锁完全一样,说明这里就是被内部ID为54线程阻塞了***************************3.row***************************......THREAD_ID:54.......LOCK_TYPE:TABLELOCK_MODE:IXLOCK_STATUS:GRANTEDLOCK_DATA:NULL***************************4.row***************************......THREAD_ID:54EVENT_ID:85OBJECT_SCHEMA:xiaoboluoOBJECT_NAME:t_luoxiaoboPARTITION_NAME:NULLSUBPARTITION_NAME:NULLINDEX_NAME:PRIMARYOBJECT_INSTANCE_BEGIN:140439793474104LOCK_TYPE:RECORDLOCK_MODE:XLOCK_STATUS:GRANTEDLOCK_DATA:24rowsinset(0.00sec)

        从上面的查询数据可以看到,performance_schema.data_locks表中新增了线程ID为55的两个锁记录,IX锁状态为GRANTED,X锁状态为WAITING,说明正在等待被授予。但这里并不能很直观地查看到锁等待关系,我们可以使用sys.innodb_lock_waits视图查看。

        root@localhost:(none)09:44:52>select*fromsys.innodb_lock_waits\G;***************************1.row***************************wait_started:2018-01-1421:51:59wait_age:00:00:11wait_age_secs:11locked_table:`xiaoboluo`.`t_luoxiaobo`locked_table_schema:xiaoboluolocked_table_name:t_luoxiaobolocked_table_partition:NULLlocked_table_subpartition:NULLlocked_index:PRIMARYlocked_type:RECORDwaiting_trx_id:55566waiting_trx_started:2018-01-1421:51:59waiting_trx_age:00:00:11waiting_trx_rows_locked:1waiting_trx_rows_modified:0waiting_pid:8waiting_query:updatet_luoxiaobosetdatet_time=now()whereid=2waiting_lock_id:55566:2:4:2waiting_lock_mode:Xblocking_trx_id:55562blocking_pid:7blocking_query:NULLblocking_lock_id:55562:2:4:2blocking_lock_mode:Xblocking_trx_started:2018-01-1421:34:44blocking_trx_age:00:17:26blocking_trx_rows_locked:1blocking_trx_rows_modified:1sql_kill_blocking_query:KILLQUERY7sql_kill_blocking_connection:KILL71rowinset(0.02sec)

        PS:在MySQL 5.7版本中,也可以使用sys.innodb_lock_waits视图查询,但是在8.0中,该视图联结查询的表不同(把之前版本中使用的information_schema.innodb_locks和information_schema.innodb_lock_waits表替换为了performance_schema.data_locks和performance_schema.data_lock_waits表),另外,在MySQL 5.6及其之前的版本中默认情况下并没有sys库,我们可以使用如下语句代替:

        SELECTr.trx_wait_startedASwait_started,TIMEDIFF(NOW(),r.trx_wait_started)ASwait_age,TIMESTAMPDIFF(SECOND,r.trx_wait_started,NOW())ASwait_age_secs,rl.lock_tableASlocked_table,rl.lock_indexASlocked_index,rl.lock_typeASlocked_type,r.trx_idASwaiting_trx_id,r.trx_startedaswaiting_trx_started,TIMEDIFF(NOW(),r.trx_started)ASwaiting_trx_age,r.trx_rows_lockedASwaiting_trx_rows_locked,r.trx_rows_modifiedASwaiting_trx_rows_modified,r.trx_mysql_thread_idASwaiting_pid,sys.format_statement(r.trx_query)ASwaiting_query,rl.lock_idASwaiting_lock_id,rl.lock_modeASwaiting_lock_mode,b.trx_idASblocking_trx_id,b.trx_mysql_thread_idASblocking_pid,sys.format_statement(b.trx_query)ASblocking_query,bl.lock_idASblocking_lock_id,bl.lock_modeASblocking_lock_mode,b.trx_startedASblocking_trx_started,TIMEDIFF(NOW(),b.trx_started)ASblocking_trx_age,b.trx_rows_lockedASblocking_trx_rows_locked,b.trx_rows_modifiedASblocking_trx_rows_modified,CONCAT('KILLQUERY',b.trx_mysql_thread_id)ASsql_kill_blocking_query,CONCAT('KILL',b.trx_mysql_thread_id)ASsql_kill_blocking_connectionFROMinformation_schema.innodb_lock_waitswINNERJOINinformation_schema.innodb_trxbONb.trx_id=w.blocking_trx_idINNERJOINinformation_schema.innodb_trxrONr.trx_id=w.requesting_trx_idINNERJOINinformation_schema.innodb_locksblONbl.lock_id=w.blocking_lock_idINNERJOINinformation_schema.innodb_locksrlONrl.lock_id=w.requested_lock_idORDERBYr.trx_wait_started;
        3. 查看最近的SQL执行信息

        3.1 查看最近的top sql

        使用performance_schema中的语句当前事件记录表和语句事件历史记录表可以查询数据库中最近执行的一些SQL语句,以及语句相关的信息,这里我们以events_statements_history表为例,查询结果按照语句完成时间倒序排序,如下:

        root@localhost:performance_schema04:33:33>selectTHREAD_ID,EVENT_NAME,SOURCE,sys.format_time(TIMER_WAIT),sys.format_time(LOCK_TIME),SQL_TEXT,CURRENT_SCHEMA,MESSAGE_TEXT,ROWS_AFFECTED,ROWS_SENT,ROWS_EXAMINEDfromevents_statements_historywhereCURRENT_SCHEMA!='performance_schema'orderbyTIMER_WAITdesclimit10\G;***************************1.row***************************THREAD_ID:114EVENT_NAME:statement/sql/updateSOURCE:socket_connection.cc:101sys.format_time(TIMER_WAIT):24.93msys.format_time(LOCK_TIME):24.93mSQL_TEXT:updatesbtest1setpad='xxx'whereid=1CURRENT_SCHEMA:sbtestMESSAGE_TEXT:Rowsmatched:0Changed:0Warnings:0ROWS_AFFECTED:0ROWS_SENT:0ROWS_EXAMINED:0***************************2.row***************************THREAD_ID:114EVENT_NAME:statement/sql/updateSOURCE:socket_connection.cc:101sys.format_time(TIMER_WAIT):7.84msys.format_time(LOCK_TIME):7.84mSQL_TEXT:updatesbtest1setpad='xxx'whereid=1CURRENT_SCHEMA:sbtestMESSAGE_TEXT:Rowsmatched:0Changed:0Warnings:0ROWS_AFFECTED:0ROWS_SENT:0ROWS_EXAMINED:0......10rowsinset(0.00sec)

        按照我们通常优化慢SQL的原则,优先优化执行次数最多的,然后是执行时间最长的,以上的查询结果并不是我们通常说的top sql,我们可以使用events_statements_summary_by_digest表来查询经过统计之后的top sql。

        root@localhost:performance_schema05:04:41>selectSCHEMA_NAME,DIGEST_TEXT,COUNT_STAR,sys.format_time(SUM_TIMER_WAIT)assum_time,sys.format_time(MIN_TIMER_WAIT)asmin_time,sys.format_time(AVG_TIMER_WAIT)asavg_time,sys.format_time(MAX_TIMER_WAIT)asmax_time,sys.format_time(SUM_LOCK_TIME)assum_lock_time,SUM_ROWS_AFFECTED,SUM_ROWS_SENT,SUM_ROWS_EXAMINEDfromevents_statements_summary_by_digestwhereSCHEMA_NAMEisnotnullorderbyCOUNT_STARdesclimit10\G***************************1.row***************************SCHEMA_NAME:sbtestDIGEST_TEXT:UPDATE`sbtest1`SET`pad`=?WHERE`id`=?COUNT_STAR:10sum_time:2.19hmin_time:216.90usavg_time:13.15mmax_time:1.50hsum_lock_time:2.04hSUM_ROWS_AFFECTED:3SUM_ROWS_SENT:0SUM_ROWS_EXAMINED:4***************************2.row***************************SCHEMA_NAME:sbtestDIGEST_TEXT:SHOWWARNINGSCOUNT_STAR:9sum_time:397.62usmin_time:16.50usavg_time:44.18usmax_time:122.58ussum_lock_time:0psSUM_ROWS_AFFECTED:0SUM_ROWS_SENT:0SUM_ROWS_EXAMINED:0......***************************5.row***************************SCHEMA_NAME:sbtestDIGEST_TEXT:SELECT*FROM`sbtest1`LIMIT?COUNT_STAR:5sum_time:138.93msmin_time:145.77usavg_time:27.79msmax_time:112.29mssum_lock_time:95.53msSUM_ROWS_AFFECTED:0SUM_ROWS_SENT:104SUM_ROWS_EXAMINED:104......10rowsinset(0.00sec)

        PS: events_statements_summary_by_digest 表中记录的SQL语句文本并不完整,默认情况下只截取了1024个字节,且也是使用这1024个字节的SQL文本进行hash计算,把hashcode相同的累计计算在一起,performance_schema提供的数据只能算作慢日志分析的一个补充,如果需要完整的SQL语句文本还得依赖慢查询日志分析

        3.2 查看最近执行失败的SQL

        曾经有同事问,代码对数据库的某些操作(比如:python的ORM模块操作数据库)报了语法错误,但是代码并没有记录SQL语句文本的功能,问在MySQL数据库层能否查看到具体的SQL文本,看看是否有哪里写错了。这个时候,大多数人首先想到的就是去查看错误日志。很遗憾,对于SQL语句的语法错误,错误日志并不会记录。如果你没有完全了解performance_schema,那么你很可能就会给同事回复说:MySQL层面也并没有记录语法错误的信息

        实际上,performance_schema的语句事件记录表中针对每一个语句的执行状态都记录了较为详细的信息,例如:events_statements_表和events_statements_summary_by_digest表(events_statements_表记录的语句所有的执行错误信息,但events_statements_summary_by_digest表只记录了语句在执行过程中发生的错误的语句记录统计,具体的错误类型不记录,例如:语法错误类的不记录),下面我们分别演示如何使用这两个表查询语句发生错误的语句信息

        首先,我们模拟一个语法错误的SQL,使用events_statements_history_long或者events_statements_history表查询发生语法错误的SQL语句,开启一个会话(会话1)

        root@localhost:performance_schema05:18:09>select*from;ERROR1064(42000):YouhaveanerrorinyourSQLsyntax;checkthemanualthatcorrespondstoyourMySQLserverversionfortherightsyntaxtousenear''atline1

        然后,我们查询events_statements_history_long表中错误号为1064的记录,开启另一个会话(会话2)

        root@localhost:sbtest05:32:55>useperformance_schemaDatabasechangedroot@localhost:performance_schema05:33:03>selectTHREAD_ID,EVENT_NAME,SOURCE,sys.format_time(TIMER_WAIT)asexec_time,sys.format_time(LOCK_TIME)aslock_time,SQL_TEXT,CURRENT_SCHEMA,MESSAGE_TEXT,ROWS_AFFECTED,ROWS_SENT,ROWS_EXAMINED,MYSQL_ERRNOfromevents_statements_historywhereMYSQL_ERRNO=1064\G;***************************1.row***************************THREAD_ID:119EVENT_NAME:statement/sql/errorSOURCE:socket_connection.cc:101exec_time:71.72uslock_time:0psSQL_TEXT:select*fromCURRENT_SCHEMA:sbtestMESSAGE_TEXT:YouhaveanerrorinyourSQLsyntax;checkthemanualthatcorrespondstoyourMySQLserverversionfortherightsyntaxtouseROWS_AFFECTED:0ROWS_SENT:0ROWS_EXAMINED:0MYSQL_ERRNO:10641rowinset(0.01sec)

        可能你不知道错误号是多少,可以查询发生错误次数不为0的语句记录,在里边找到MESSAGE_TEXT字段提示信息为语法错误的就是它了。

        root@localhost:performance_schema05:34:00>selectTHREAD_ID,EVENT_NAME,SOURCE,sys.format_time(TIMER_WAIT)asexec_time,sys.format_time(LOCK_TIME)aslock_time,SQL_TEXT,CURRENT_SCHEMA,MESSAGE_TEXT,ROWS_AFFECTED,ROWS_SENT,ROWS_EXAMINED,MYSQL_ERRNO,errorsfromevents_statements_historywhereerrors>0\G;***************************1.row***************************THREAD_ID:119EVENT_NAME:statement/sql/errorSOURCE:socket_connection.cc:101exec_time:71.72uslock_time:0psSQL_TEXT:select*fromCURRENT_SCHEMA:sbtestMESSAGE_TEXT:YouhaveanerrorinyourSQLsyntax;checkthemanualthatcorrespondstoyourMySQLserverversionfortherightsyntaxtouseROWS_AFFECTED:0ROWS_SENT:0ROWS_EXAMINED:0MYSQL_ERRNO:1064errors:11rowinset(0.00sec)

        使用events_statements_summary_by_digest表查询发生语句执行错误的SQL语句记录,首先,我们在会话1制造一两个语句执行一定会发生错误的语句。

        root@localhost:sbtest05:32:34>select*;ERROR1096(HY000):Notablesusedroot@localhost:sbtest05:40:57>select*fromsbtest4whereidbetween100and2000andxx=1;ERROR1054(42S22):Unknowncolumn'xx'in'whereclause'

        然后,我们在events_statements_summary_by_digest表中查询发生错误次数大于0的记录,在会话2执行。

        root@localhost:performance_schema05:34:03>selectSCHEMA_NAME,DIGEST_TEXT,COUNT_STAR,sys.format_time(AVG_TIMER_WAIT)asavg_time,sys.format_time(MAX_TIMER_WAIT)asmax_time,sys.format_time(SUM_LOCK_TIME)assum_lock_time,SUM_ERRORS,FIRST_SEEN,LAST_SEENfromevents_statements_summary_by_digestwhereSUM_ERRORS!=0\G;***************************1.row***************************......***************************10.row***************************SCHEMA_NAME:sbtestDIGEST_TEXT:SELECT*#这里就是第一个执行错误的语句COUNT_STAR:1avg_time:55.14usmax_time:55.14ussum_lock_time:0psSUM_ERRORS:1FIRST_SEEN:2018-06-2517:40:57LAST_SEEN:2018-06-2517:40:57***************************11.row***************************SCHEMA_NAME:sbtestDIGEST_TEXT:SELECT*FROM`sbtest4`WHERE`id`BETWEEN?AND?AND`xx`=?#这里就是第二个执行错误的语句COUNT_STAR:1avg_time:101.68usmax_time:101.68ussum_lock_time:0psSUM_ERRORS:1FIRST_SEEN:2018-06-2517:41:03LAST_SEEN:2018-06-2517:41:0311rowsinset(0.00sec)

        PS:我们前面说过,events_statements_summary_by_digest表中不记录具体的错误信息,只做错误语句统计,所以,如果需要查询到具体的错误信息(如:具体的错误代码,具体的错误提示信息以及具体的错误SQL文本等),还需要查询events_statements_history或者events_statements_history_long表。

        root@localhost:performance_schema05:45:03>selectTHREAD_ID,EVENT_NAME,SOURCE,sys.format_time(TIMER_WAIT)asexec_time,sys.format_time(LOCK_TIME)aslock_time,SQL_TEXT,CURRENT_SCHEMA,MESSAGE_TEXT,ROWS_AFFECTED,ROWS_SENT,ROWS_EXAMINED,MYSQL_ERRNOfromevents_statements_historywhereMYSQL_ERRNO!=0\G;***************************1.row***************************......***************************2.row***************************THREAD_ID:119EVENT_NAME:statement/sql/selectSOURCE:socket_connection.cc:101exec_time:55.14uslock_time:0psSQL_TEXT:select*CURRENT_SCHEMA:sbtestMESSAGE_TEXT:NotablesusedROWS_AFFECTED:0ROWS_SENT:0ROWS_EXAMINED:0MYSQL_ERRNO:1096***************************3.row***************************THREAD_ID:119EVENT_NAME:statement/sql/selectSOURCE:socket_connection.cc:101exec_time:101.68uslock_time:0psSQL_TEXT:select*fromsbtest4whereidbetween100and2000andxx=1CURRENT_SCHEMA:sbtestMESSAGE_TEXT:Unknowncolumn'xx'in'whereclause'ROWS_AFFECTED:0ROWS_SENT:0ROWS_EXAMINED:0MYSQL_ERRNO:10543rowsinset(0.00sec)
        4. 查看SQL执行进度信息

        MariaDB分支支持一个不依赖于performance_schema性能数据的进度展示功能,我们通过show processlist语句返回结果的最后一列就是进度信息。

        root@localhostSunJan1414:08:29201814:08:29[(none)]>showprocesslist;+----+------+-----------+-----------+---------+------+----------------+-------------------------------------------------+----------+|Id|User|Host|db|Command|Time|State|Info|Progress|+----+------+-----------+-----------+---------+------+----------------+-------------------------------------------------+----------+|4|root|localhost|employees|Query|6|alteringtable|altertablesalariesaddindexi_salary(salary)|93.939||5|root|localhost|NULL|Query|0|init|showprocesslist|0.000|+----+------+-----------+-----------+---------+------+----------------+-------------------------------------------------+----------+2rowsinset(0.00sec)

        在MySQL中也提供了类似的功能,通过阶段事件中具有可预估工作量的阶段事件进行记录与计算,就可以得到一个语句的执行的阶段信息和进度信息,下面我们分别举例介绍如何查看

        4.1 查看SQL执行阶段

        首先, 我们需要进行配置启用,阶段事件默认并未启用,开启一个会话(会话1)

        root@localhost:performance_schema05:59:26>useperformance_schemaDatabasechangedroot@localhost:performance_schema05:59:45>updatesetup_instrumentssetenabled='yes',timed='yes'wherenamelike'stage/%';QueryOK,120rowsaffected(0.00sec)Rowsmatched:129Changed:120Warnings:0root@localhost:performance_schema05:59:47>updatesetup_consumerssetenabled='yes'wherenamelike'%stage%';QueryOK,3rowsaffected(0.00sec)Rowsmatched:3Changed:3Warnings:0

        开启第二个会话(会话2),查询thread_id

        root@localhost:sbtest06:02:22>selectsys.ps_thread_id(connection_id());+-----------------------------------+|sys.ps_thread_id(connection_id())|+-----------------------------------+|119|+-----------------------------------+1rowinset(0.00sec)

        先对之前旧的信息进行清理,避免干扰(会话1)

        #先关闭其他线程的事件记录功能,使用前面步骤查询到的thread_idroot@localhost:performance_schema06:05:38>updateperformance_schema.threadssetINSTRUMENTED='NO'whereTHREAD_ID!=119;QueryOK,101rowsaffected(0.00sec)Rowsmatched:101Changed:101Warnings:0#清空阶段事件的3张表root@localhost:performance_schema05:59:52>truncateevents_stages_current;truncateevents_stages_history;truncateevents_stages_history_long;QueryOK,0rowsaffected(0.00sec)QueryOK,0rowsaffected(0.00sec)QueryOK,0rowsaffected(0.02sec)

        现在,回到会话2执行DML语句

        root@localhost:sbtest06:06:37>selectcount(*)fromsbtest.sbtest4whereidbetween100and200;+----------+|count(*)|+----------+|50|+----------+1rowinset(0.00sec)

        在会话1中查询events_stages_history_long表

        root@localhost:performance_schema06:10:37>selectTHREAD_ID,EVENT_NAME,SOURCE,sys.format_time(TIMER_WAIT)asexec_time,WORK_COMPLETED,WORK_ESTIMATEDfromevents_stages_history_long;+-----------+--------------------------------+--------------------------+-----------+----------------+----------------+|THREAD_ID|EVENT_NAME|SOURCE|exec_time|WORK_COMPLETED|WORK_ESTIMATED|+-----------+--------------------------------+--------------------------+-----------+----------------+----------------+|119|stage/sql/starting|socket_connection.cc:107|54.19us|NULL|NULL||119|stage/sql/checkingpermissions|sql_authorization.cc:810|3.62us|NULL|NULL||119|stage/sql/Openingtables|sql_base.cc:5650|10.54us|NULL|NULL||119|stage/sql/init|sql_select.cc:121|16.73us|NULL|NULL||119|stage/sql/Systemlock|lock.cc:323|4.77us|NULL|NULL||119|stage/sql/optimizing|sql_optimizer.cc:151|4.78us|NULL|NULL||119|stage/sql/statistics|sql_optimizer.cc:367|50.54us|NULL|NULL||119|stage/sql/preparing|sql_optimizer.cc:475|7.79us|NULL|NULL||119|stage/sql/executing|sql_executor.cc:119|381.00ns|NULL|NULL||119|stage/sql/Sendingdata|sql_executor.cc:195|36.75us|NULL|NULL||119|stage/sql/end|sql_select.cc:199|931.00ns|NULL|NULL||119|stage/sql/queryend|sql_parse.cc:4968|5.31us|NULL|NULL||119|stage/sql/closingtables|sql_parse.cc:5020|2.26us|NULL|NULL||119|stage/sql/freeingitems|sql_parse.cc:5596|8.71us|NULL|NULL||119|stage/sql/cleaningup|sql_parse.cc:1902|449.00ns|NULL|NULL|+-----------+--------------------------------+--------------------------+-----------+----------------+----------------+15rowsinset(0.01sec)

        通过以上的查询数据可以清晰地看到一个select语句的执行全过程,以及每一个过程的时间开销等信息,那DDL语句的执行阶段又是怎样的呢?

        先对之前旧的信息进行清理,避免干扰(会话1)

        root@localhost:performance_schema06:10:48>truncateevents_stages_current;truncateevents_stages_history;truncateevents_stages_history_long;QueryOK,0rowsaffected(0.00sec)QueryOK,0rowsaffected(0.00sec)QueryOK,0rowsaffected(0.02sec)

        然后,执行DDL语句(会话2)

        root@localhost:sbtest03:37:32>altertablesbtest1addindexi_c(c);

        此时,在会话1中查询阶段事件信息(此时DDL语句并未执行完成,从最后一行记录信息中可以看到,WORK_COMPLETED 和WORK_ESTIMATED 列值不为NULL,表示该阶段事件是一个可以度量的事件)

        root@localhost:performance_schema06:30:04>selectTHREAD_ID,EVENT_NAME,SOURCE,sys.format_time(TIMER_WAIT)asexec_time,WORK_COMPLETED,WORK_ESTIMATEDfromevents_stages_history_long;+-----------+------------------------------------------------------+--------------------------+-----------+----------------+----------------+|THREAD_ID|EVENT_NAME|SOURCE|exec_time|WORK_COMPLETED|WORK_ESTIMATED|+-----------+------------------------------------------------------+--------------------------+-----------+----------------+----------------+|119|stage/sql/starting|socket_connection.cc:107|44.17us|NULL|NULL||119|stage/sql/checkingpermissions|sql_authorization.cc:810|1.46us|NULL|NULL||119|stage/sql/checkingpermissions|sql_authorization.cc:810|2.29us|NULL|NULL||119|stage/sql/init|sql_table.cc:9031|2.16us|NULL|NULL||119|stage/sql/Openingtables|sql_base.cc:5650|107.57us|NULL|NULL||119|stage/sql/setup|sql_table.cc:9271|19.19us|NULL|NULL||119|stage/sql/creatingtable|sql_table.cc:5222|1.06ms|NULL|NULL||119|stage/sql/Aftercreate|sql_table.cc:5355|76.22us|NULL|NULL||119|stage/sql/Systemlock|lock.cc:323|4.38us|NULL|NULL||119|stage/sql/preparingforaltertable|sql_table.cc:7454|28.63ms|NULL|NULL||119|stage/sql/alteringtable|sql_table.cc:7508|3.91us|NULL|NULL||119|stage/innodb/altertable(readPKandinternalsort)|ut0stage.h:241|27.09s|230040|470155|+-----------+------------------------------------------------------+--------------------------+-----------+----------------+----------------+12rowsinset(0.01sec)

        待到DDL语句执行完成之后,我们再次查看阶段事件信息(会话1)

        root@localhost:performance_schema06:31:07>selectTHREAD_ID,EVENT_NAME,SOURCE,sys.format_time(TIMER_WAIT)asexec_time,WORK_COMPLETED,WORK_ESTIMATEDfromevents_stages_history_long;+-----------+------------------------------------------------------+--------------------------+-----------+----------------+----------------+|THREAD_ID|EVENT_NAME|SOURCE|exec_time|WORK_COMPLETED|WORK_ESTIMATED|+-----------+------------------------------------------------------+--------------------------+-----------+----------------+----------------+......|119|stage/innodb/altertable(readPKandinternalsort)|ut0stage.h:241|27.09s|230040|470155||119|stage/innodb/altertable(mergesort)|ut0stage.h:501|1.15m|345060|512319||119|stage/innodb/altertable(insert)|ut0stage.h:501|11.83s|460146|523733||119|stage/innodb/altertable(flush)|ut0stage.h:501|18.35s|523658|523733||119|stage/innodb/altertable(logapplyindex)|ut0stage.h:501|54.63ms|524042|524042||119|stage/innodb/altertable(flush)|ut0stage.h:501|21.18us|524042|524042||119|stage/sql/committingaltertabletostorageengine|sql_table.cc:7535|5.12us|NULL|NULL||119|stage/innodb/altertable(end)|ut0stage.h:501|233.52ms|524042|524042|......+-----------+------------------------------------------------------+--------------------------+-----------+----------------+----------------+24rowsinset(0.01sec)

        通过以上的查询数据可以清晰地看到一个alter语句添加索引的执行全过程,以及每一个过程的时间开销等信息,执行时间最长的是stage/innodb/alter table (merge sort),其次是stage/innodb/alter table (read PK and internal sort),说明在本示例中创建索引主要的时间开销在于内部的数据排序和排序合并操作

        PS:阶段事件长历史记录表中的数据产生较快,默认的10000行配额可能很快就被打满了,可在配置文件中把配额调整为一个较大值,以便完整查看DDL语句执行阶段(例如:performance_schema_events_stages_history_long_size=1000000,同时要注意关掉其他不相干的任务)

        4.2 查看SQL执行进度

        在官方MySQL 版本中,performance_schema下并没有很直观地查询整个语句执行进度的方法,但是可以借助后续章节中介绍的sys.session视图进行查看。

        root@localhost:performance_schema04:16:38>select*fromsys.sessionwhereconn_id!=connection_id()\G;***************************1.row***************************thd_id:45conn_id:4......state:altertable(mergesort)time:30current_statement:altertablesbtest1addindexi_c(c)statement_latency:29.42sprogress:46.40#进度百分比在这里lock_latency:2.19msrows_examined:0rows_sent:0rows_affected:0tmp_tables:0tmp_disk_tables:0full_scan:NO......program_name:mysql1rowinset(0.33sec)
        5. 查看最近的事务执行信息

        虽然,我们可以通过慢查询日志查询到一个语句的执行总时长,但,如果数据库中存在着一些大事务执行过程中回滚了,或者说执行过程中异常终止,这个时候慢查询日志就爱莫能助了,这个时候我们可以借助performance_schema的events_transactions_*表进行查看事务相关的记录,这些表中详细记录了是否有事务被回滚、活跃(长事件未提交的事务也属于活跃事件)或已提交等,下面我们分别模拟几种事务情况,并查看事务事件记录表。

        首先,我们需要进行配置启用,事务事件默认并未启用(会话1)。

        root@localhost:performance_schema04:16:59>updatesetup_instrumentssetenabled='yes',timed='yes'wherenamelike'transaction';QueryOK,1rowaffected(0.00sec)Rowsmatched:1Changed:1Warnings:0root@localhost:performance_schema04:23:12>updatesetup_consumerssetenabled='yes'wherenamelike'%transaction%';QueryOK,3rowsaffected(0.00sec)Rowsmatched:3Changed:3Warnings:0

        执行清理,避免其他事务干扰(会话1)

        root@localhost:performance_schema04:30:25>truncateevents_transactions_current;truncateevents_transactions_history;truncateevents_transactions_history_long;QueryOK,0rowsaffected(0.00sec)QueryOK,0rowsaffected(0.00sec)QueryOK,0rowsaffected(0.00sec)

        然后,开启一个新会话(会话2)用于执行事务,并模拟事务回滚

        root@localhost:sbtest04:18:34>usesbtestDatabasechangedroot@localhost:sbtest04:24:27>begin;QueryOK,0rowsaffected(0.00sec)root@localhost:sbtest04:25:02>updatesbtest1setpad='yyy'whereid=1;QueryOK,1rowaffected(0.01sec)Rowsmatched:1Changed:1Warnings:0

        会话1查询活跃事务,活跃事务表示当前正在执行的事务事件,需要从events_transactions_current表查询。

        root@localhost:performance_schema04:33:44>selectTHREAD_ID,EVENT_NAME,STATE,TRX_ID,GTID,SOURCE,TIMER_WAIT,ACCESS_MODE,ISOLATION_LEVEL,AUTOCOMMIT,NESTING_EVENT_ID,NESTING_EVENT_TYPEfromevents_transactions_current\G;***************************1.row***************************THREAD_ID:47EVENT_NAME:transactionSTATE:ACTIVETRX_ID:NULLGTID:AUTOMATICSOURCE:transaction.cc:209TIMER_WAIT:21582764879000ACCESS_MODE:READWRITEISOLATION_LEVEL:READCOMMITTEDAUTOCOMMIT:NONESTING_EVENT_ID:30NESTING_EVENT_TYPE:STATEMENT1rowinset(0.00sec)

        会话2,回滚事务,被回滚完成的事务不再活跃

        root@localhost:sbtest04:25:08>rollback;QueryOK,0rowsaffected(0.01sec)

        会话1,查询事务事件历史记录表events_transactions_history_long

        root@localhost:performance_schema04:27:34>selectTHREAD_ID,EVENT_NAME,STATE,TRX_ID,GTID,SOURCE,TIMER_WAIT,ACCESS_MODE,ISOLATION_LEVEL,AUTOCOMMIT,NESTING_EVENT_ID,NESTING_EVENT_TYPEfromevents_transactions_history_long\G;***************************1.row***************************THREAD_ID:45EVENT_NAME:transactionSTATE:ROLLEDBACKTRX_ID:NULLGTID:AUTOMATICSOURCE:transaction.cc:209TIMER_WAIT:39922043951000ACCESS_MODE:READWRITEISOLATION_LEVEL:READCOMMITTEDAUTOCOMMIT:NONESTING_EVENT_ID:194NESTING_EVENT_TYPE:STATEMENT1rowinset(0.00sec)

        可以看到在事务事件表中记录了一行事务事件信息,线程ID为45的线程执行了一个事务,事务状态为ROLLED BACK,现在,我们来模拟事务正常提交。

        #会话2root@localhost:sbtest04:40:27>begin;QueryOK,0rowsaffected(0.00sec)root@localhost:sbtest04:40:29>updatesbtest1setpad='yyy'whereid=1;QueryOK,1rowaffected(0.00sec)Rowsmatched:1Changed:1Warnings:0root@localhost:sbtest04:40:31>commit;QueryOK,0rowsaffected(0.01sec)#会话1root@localhost:performance_schema04:38:32>selectTHREAD_ID,EVENT_NAME,STATE,TRX_ID,GTID,SOURCE,TIMER_WAIT,ACCESS_MODE,ISOLATION_LEVEL,AUTOCOMMIT,NESTING_EVENT_ID,NESTING_EVENT_TYPEfromevents_transactions_current\G;***************************1.row***************************THREAD_ID:44EVENT_NAME:transactionSTATE:COMMITTEDTRX_ID:421759004106352GTID:AUTOMATICSOURCE:handler.cc:1421TIMER_WAIT:87595486000ACCESS_MODE:READWRITEISOLATION_LEVEL:READCOMMITTEDAUTOCOMMIT:YESNESTING_EVENT_ID:24003703NESTING_EVENT_TYPE:STATEMENT***************************2.row***************************THREAD_ID:47EVENT_NAME:transactionSTATE:COMMITTEDTRX_ID:NULLGTID:ec123678-5e26-11e7-9d38-000c295e08a0:181879SOURCE:transaction.cc:209TIMER_WAIT:7247256746000ACCESS_MODE:READWRITEISOLATION_LEVEL:READCOMMITTEDAUTOCOMMIT:NONESTING_EVENT_ID:55NESTING_EVENT_TYPE:STATEMENT2rowsinset(0.00sec)

        从上面的查询数据可以看到,第二行事务事件记录中的事务事件为COMMITTED状态,表示事务已经提交成功

        PS:如果一个事务长时间未提交(长事件处于ACTIVE状态),这种情况虽然我们从events_transactions_current表中可以查询到未提交事务事件信息,但是并不能很直观地看到事务是什么时间点开始的,我们可以借助于information_schema.innodb_trx表来进行辅助判断。

        root@localhost:performance_schema04:57:50>select*frominformation_schema.innodb_trx\G;***************************1.row***************************trx_id:2454336trx_state:RUNNINGtrx_started:2018-01-1416:43:29trx_requested_lock_id:NULLtrx_wait_started:NULLtrx_weight:3trx_mysql_thread_id:6......1rowinset(0.00sec)
        6. 查看多线程复制报错详情

        官方MySQL 从5.6版本开始支持基于库级别的并行复制,在MySQL 5.7版本中支持基于事务的并行复制,在我们启用了并行复制之后,如果一旦发生复制报错,通常通过show slave status语句无法查看到具体的报错详情(show slave status语句只能查看到SQL线程的报错信息,而在多线程复制下,SQL线程的报错信息是根据worker线程的报错信息的一个汇总信息),类似如下:

        admin@localhost:(none)12:45:19>showslavestatus\G;............Last_Errno:1062Last_Error:Coordinatorstoppedbecausetherewereerror(s)intheworker(s).Themostrecentfailurebeing:Worker1failedexecutingtransaction'23fb5832-e4bc-11e7-8ea4-525400a4b2e1:2553990'atmasterlogmysql-bin.000034,end_log_pos98797.Seeerrorlogand/orperformance_schema.replication_applier_status_by_workertableformoredetailsaboutthisfailureorothers,ifany.............Last_SQL_Errno:1062Last_SQL_Error:Coordinatorstoppedbecausetherewereerror(s)intheworker(s).Themostrecentfailurebeing:Worker1failedexecutingtransaction'23fb5832-e4bc-11e7-8ea4-525400a4b2e1:2553990'atmasterlogmysql-bin.000034,end_log_pos98797.Seeerrorlogand/orperformance_schema.replication_applier_status_by_workertableformoredetailsaboutthisfailureorothers,ifany.............1rowinset(0.00sec)

        根据报错提示查看performance_schema.replication_applier_status_by_worker表,该表中详细记录了每一个worker线程的详细信息,从这里我们就可以找到发生报错的worker线程具体的报错原因。

        admin@localhost:(none)12:51:53>select*fromperformance_schema.replication_applier_status_by_workerwhereLAST_ERROR_MESSAGE!=''\G;***************************1.row***************************CHANNEL_NAME:WORKER_ID:2THREAD_ID:NULLSERVICE_STATE:OFFLAST_SEEN_TRANSACTION:23fb5832-e4bc-11e7-8ea4-525400a4b2e1:2553991LAST_ERROR_NUMBER:1062LAST_ERROR_MESSAGE:Worker2failedexecutingtransaction'23fb5832-e4bc-11e7-8ea4-525400a4b2e1:2553991'atmasterlogmysql-bin.000034,end_log_pos99514;CouldnotexecuteWrite_rowseventontablesbtest.sbtest4;Duplicateentry'833353'forkey'PRIMARY',Error_code:1062;handlererrorHA_ERR_FOUND_DUPP_KEY;theevent'smasterlogFIRST,end_log_pos99514LAST_ERROR_TIMESTAMP:2018-01-0214:08:581rowinset(0.00sec)

        从查询performance_schema.replication_applier_status_by_worker表可以发现,具体的复制报错信息是因为主键冲突了

        PS:由于历史原因,performance_schema中的复制信息记录表只记录跟GTID相关的信息,而在mysql系统字典库下的slave_master_info、slave_relay_log_info、slave_worker_info表记录的是跟binlog position相关的信息。另外,如果选择相关的复制信息记录到文件,那么磁盘上还存在着master.info、relay_log.info等文件记录binlog position相关的信息。

        | 作者简介

        罗小波·沃趣科技高级数据库技术专家

        IT从业多年,历任运维工程师,高级运维工程师,运维经理,数据库工程师,曾参与版本发布系统,轻量级监控系统,运维管理平台,数据库管理平台的设计与编写,熟悉MySQL的体系结构时,InnoDB存储引擎,喜好专研开源技术,追求完美。

    应用示例荟萃 | performance_schema全方位介绍.docx

    将本文的Word文档下载到电脑

    推荐度:

    下载