Database congelando com switch logfile – …::: César DBA :::…

Recentemente aconteceu um problema crítico nos databases de produção em um exadata X8, foi relatado que os sistemas estariam com problemas, todos os databases parados por quatro dias, várias reuniões com engenheiros da Oracle, até conseguir contornar o problema.

Sintomas: ao iniciado o database, um dos nodes do RAC apresentava o status mount e o segundo node com read write. Ao tentar conectar no database, todas as sessões congelavam e o database não respondia, só foi possível desligar o database com o shutdown abort, este comportamento inicialmente ocorria no database 12.2.0.1, após algumas horas afetou todos os databases na versão 12c e 19c.

Ao avaliar o alert log, foi verificada diversas mensagens de waits do LGWR:

LGWR (ospid: 57744) waits for event 'LGWR any worker group' for 81 secs.
2023-07-01T02:01:42.167640-03:00
LG00 (ospid: 57752) waits for event 'LGWR worker group ordering' for 83 secs.
2023-07-01T02:01:46.351397-03:00
LG04 (ospid: 57783) has not called a wait for 87 secs.
2023-07-01T02:01:50.163474-03:00
LGWR (ospid: 57744) waits for event 'LGWR any worker group' for 91 secs.

LGWR (ospid: 274451) has not called a wait for 115 secs.
2023-06-30T12:35:20.455730-03:00
LGWR (ospid: 274451) has not called a wait for 125 secs.
2023-06-30T12:35:30.551891-03:00
LGWR (ospid: 274451) has not called a wait for 135 secs.
2023-06-30T12:35:40.487719-03:00
LGWR (ospid: 274451) has not called a wait for 145 secs.
2023-06-30T12:35:50.567833-03:00
LGWR (ospid: 274451) has not called a wait for 155 secs.
2023-06-30T12:36:00.487031-03:00
LGWR (ospid: 274451) has not called a wait for 165 secs.
2023-06-30T12:36:10.567764-03:00
LGWR (ospid: 274451) has not called a wait for 175 secs.
2023-06-30T12:36:20.503833-03:00
LGWR (ospid: 274451) has not called a wait for 185 secs.
2023-06-30T12:36:30.583830-03:00
LGWR (ospid: 274451) has not called a wait for 195 secs.
2023-06-30T12:36:40.511774-03:00
LGWR (ospid: 274451) has not called a wait for 205 secs.
2023-06-30T12:36:50.600833-03:00
LGWR (ospid: 274451) has not called a wait for 215 secs.

Ao tentar realizar o switch logfile, o database congelava e não era possível realizar nenhuma operação. No alert log foi encontrado as mensagens com eventos de wait do LGWR async disk IO:

SYS@prod2 AS SYSDBA> alter system switch logfile;

LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 72 secs.
2023-06-30T06:31:53.646648-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 82 secs.
2023-06-30T06:32:03.726709-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 92 secs.
2023-06-30T06:32:13.774640-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 102 secs.
2023-06-30T06:32:23.711611-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 112 secs.
2023-06-30T06:32:33.790802-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 122 secs.
2023-06-30T06:32:43.718692-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 132 secs.
2023-06-30T06:32:53.774656-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 142 secs.
2023-06-30T06:33:03.702753-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 152 secs.
2023-06-30T06:33:13.782623-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 162 secs.
2023-06-30T06:33:23.710653-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 172 secs.
2023-06-30T06:33:33.789718-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 182 secs.
2023-06-30T06:33:43.718594-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 192 secs.
2023-06-30T06:33:53.806656-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 202 secs.
2023-06-30T06:34:03.854815-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 212 secs.
2023-06-30T06:34:13.790693-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 222 secs.
2023-06-30T06:34:23.726686-03:00
LGWR (ospid: 48245) waits for event 'kfk: async disk IO' for 232 secs.

Foi solicitado pelo suporte da Oracle que fosse avaliado a latência com todos os cell nodes, foi evidenciado que existia a latência, seguem os logs abaixo:

[root@dm02vm01 ~]# rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.2 10.1.11.4
1: 11961 usec #********** High latency. ********
[root@dm02vm01 ~]# rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.2 10.1.11.5
1: 10768 usec #********** High latency. ********
[root@dm02vm01 ~]# rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.2 10.1.11.6
1: 54 usec
[root@dm02vm01 ~]# rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.2 10.1.11.7
1: 179 usec
[root@dm02vm01 ~]# rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.2 10.1.11.8
1: 70 usec
[root@dm02vm01 ~]# rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.2 10.1.11.9
1: 67 usec
[root@dm02vm01 ~]# rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.2 10.1.11.10
1: 69 usec

[INFO:2023-06-29-16:39:31] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 5 -I 10.1.11.26 10.1.11.12: 1: 5512 usec #High latency
[INFO:2023-06-29-16:45:36] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 5 -I 10.1.11.26 10.1.11.16: 1: 2035 usec #High latency
[INFO:2023-06-29-17:03:47] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 3 -I 10.1.11.25 10.1.11.17: 1: 1055 usec
[INFO:2023-06-29-16:25:22] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 4 -I 10.1.11.25 10.1.11.10: 1: 981 usec
[INFO:2023-06-29-16:45:36] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 2 -I 10.1.11.26 10.1.11.6: 1: 794 usec
[INFO:2023-06-29-16:45:36] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 4 -I 10.1.11.26 10.1.11.17: 1: 782 usec
[INFO:2023-06-29-16:45:36] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.26 10.1.11.17: 1: 579 usec
[INFO:2023-06-29-17:11:53] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 2 -I 10.1.11.26 10.1.11.12: 1: 537 usec
[INFO:2023-06-29-16:31:26] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.26 10.1.11.11: 1: 414 usec
[INFO:2023-06-29-16:17:18] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 5 -I 10.1.11.26 10.1.11.12: 1: 379 usec
[INFO:2023-06-29-16:43:34] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 5 -I 10.1.11.25 10.1.11.18: 1: 344 usec
[INFO:2023-06-29-16:19:18] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 2 -I 10.1.11.25 10.1.11.11: 1: 319 usec
[INFO:2023-06-29-16:25:23] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 3 -I 10.1.11.26 10.1.11.12: 1: 246 usec
[INFO:2023-06-29-16:13:16] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.26 10.1.11.6: 1: 236 usec
[INFO:2023-06-29-16:45:35] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 4 -I 10.1.11.25 10.1.11.16: 1: 232 usec
[INFO:2023-06-29-16:21:20] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 3 -I 10.1.11.26 10.1.11.12: 1: 227 usec
[INFO:2023-06-29-16:25:23] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 2 -I 10.1.11.26 10.1.11.13: 1: 221 usec
[INFO:2023-06-29-16:17:17] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 4 -I 10.1.11.25 10.1.11.10: 1: 221 usec
[INFO:2023-06-29-17:05:49] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.26 10.1.11.2: 1: 207 usec
[INFO:2023-06-29-16:31:26] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.25 10.1.11.11: 1: 206 usec
[INFO:2023-06-29-17:09:51] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 5 -I 10.1.11.26 10.1.11.19: 1: 205 usec
[INFO:2023-06-29-17:03:48] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.26 10.1.11.27: 1: 202 usec
[INFO:2023-06-29-16:27:23] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 4 -I 10.1.11.25 10.1.11.14: 1: 200 usec
[INFO:2023-06-29-17:07:50] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 3 -I 10.1.11.26 10.1.11.14: 1: 192 usec
[INFO:2023-06-29-16:59:45] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 2 -I 10.1.11.26 10.1.11.13: 1: 189 usec
[INFO:2023-06-29-16:51:39] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 4 -I 10.1.11.25 10.1.11.7: 1: 187 usec
[INFO:2023-06-29-16:27:23] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 5 -I 10.1.11.25 10.1.11.10: 1: 187 usec
[INFO:2023-06-29-16:47:37] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.25 10.1.11.2: 1: 184 usec
[INFO:2023-06-29-16:49:38] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.25 10.1.11.2: 1: 180 usec
[INFO:2023-06-29-17:09:52] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.26 10.1.11.2: 1: 176 usec
[INFO:2023-06-29-16:23:22] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 3 -I 10.1.11.26 10.1.11.7: 1: 176 usec
[INFO:2023-06-29-17:09:51] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 4 -I 10.1.11.25 10.1.11.2: 1: 175 usec
[INFO:2023-06-29-17:03:48] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.26 10.1.11.2: 1: 174 usec
[INFO:2023-06-29-16:57:43] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.25 10.1.11.2: 1: 174 usec
[INFO:2023-06-29-17:11:52] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.25 10.1.11.2: 1: 172 usec
[INFO:2023-06-29-16:49:39] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 3 -I 10.1.11.26 10.1.11.6: 1: 171 usec
[INFO:2023-06-29-16:59:45] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 0 -I 10.1.11.25 10.1.11.2: 1: 170 usec
[INFO:2023-06-29-17:01:46] numactl --cpunodebind=0 --membind=0 rds-ping -c 1 -i 5 -Q 5 -I 10.1.11.25 10.1.11.19: 1: 169 usec

Após outra análise detalhada com diversos engenheiros de suporte da Oracle, foi realizado o plano de ação abaixo:

1 – Reiniciar todos os cell node, um de cada vez.

2 – Após reiniciar o cell node, efetuar o alter system switch logfile, em seguida reiniciar o próximo cell node e executar o switch logfile sucessivamente.

Após o restart dos cell nodes afetados com a alta latência, o problema foi resolvido, abaixo switch logfile executado com sucesso:

SYS@prod2 AS SYSDBA> alter system switch logfile;

System altered.

SYS@prod2 AS SYSDBA> /

System altered.

SYS@prod2 AS SYSDBA> alter system switch all logfile;

System altered.

Link do procedimento para restart dos cell node:

https://docs.oracle.com/cd/E40166_01/html/E40167/z4001567132421442.html

Após três meses do problema ocorrido, a Oracle liberou a DOC – Hang on Logfile Switch (Doc ID 2924173.1).

Hang on Logfile Switch (Doc ID 2924173.1)
Last updated on SEPTEMBER 05, 2023
Bug 34775460 - HANG DUE TO LOG FILE SWITCH