Written by Rob Risetto
Availability Group Heath Monitoring
Like most monitoring systems these days the SQLRDS monitor has a number of availability group (AG) health checks and alerts built in. However, what if you don’t have a monitoring system in place but you have deployed an availability group or BAG (if running Standard Edition).
Checking the health of the AG on a continuous basis is very important, otherwise, you may find your environment is not in a state to failover at that crucial moment.
You can still get alerts about your AG health and status, even if you are not using a full feature monitoring system like SQLRDS Monitor. This simple solution is based on the Database Mail component to send out alerts that are generated by :-
- An SQL Agent job that runs an AG health check script every 15 minutes. The script will alert if the AG is not healthy or if the Log Send Queue is greater than a threshold (20000 KB).
- SQL Server Alerts that checks for the change of AG roles i.e. a failover has occurred.
AG Health Script
Create a SQL Agent job on each server of the Availability Group configuration and run the job every 15 minutes. Configure the AG Health Check script below as a TSQL Job Step.
— Configure values
Declare @ToMail varchar(max) = ‘<Your Email Address>‘
Declare @DBMailprofile varchar(128) = ‘<Your Database Mail Profile Name>‘
— Standard Declares
Declare @ErrorMsg varchar(8000) = ”
Declare @QueryStr nvarchar(max) = ”
Declare @ag_replica_server varchar(128) = ”
Declare @database_name varchar(128) = ”
set nocount on
— Get Health Data
SELECT ag.name AS ag_name, ar.replica_server_name AS ag_replica_server,
ag_replica_role = CASE
WHEN ar_state.role_desc IS NULL THEN N’DISCONNECTED’
ELSE ar_state.role_desc
END, db_name(dr_state.database_id) as database_name,
dr_state.synchronization_state_desc, dr_state.is_suspended, dr_state.synchronization_health_desc,
dr_state.suspend_reason_desc, isnull(dr_state.secondary_lag_seconds,0) secondary_lag_seconds,
isnull(dr_state.log_send_queue_size,0) log_send_queue_size, isnull(dr_state.redo_queue_size,0) redo_queue_size
into #AGStatus
FROM (( sys.availability_groups AS ag JOIN sys.availability_replicas AS ar ON ag.group_id = ar.group_id )
JOIN sys.dm_hadr_availability_replica_states AS ar_state ON ar.replica_id = ar_state.replica_id)
JOIN sys.dm_hadr_database_replica_states dr_state on
ag.group_id = dr_state.group_id and dr_state.replica_id = ar_state.replica_id
where ar_state.is_local = 1 — check local server only
— used to put output in DBMail alert
Set @QueryStr= ‘SELECT ag.name AS ag_name, ar.replica_server_name AS ag_replica_server,
ag_replica_role = CASE WHEN ar_state.role_desc IS NULL THEN N”DISCONNECTED” ELSE ar_state.role_desc END,
db_name(dr_state.database_id) as database_name, dr_state.synchronization_state_desc, dr_state.is_suspended,
dr_state.synchronization_health_desc, dr_state.suspend_reason_desc, isnull(dr_state.secondary_lag_seconds,0) secondary_lag_seconds,
isnull(dr_state.log_send_queue_size,0) log_send_queue_size, isnull(dr_state.redo_queue_size,0) redo_queue_size
FROM (( sys.availability_groups AS ag JOIN sys.availability_replicas AS ar ON ag.group_id = ar.group_id )
JOIN sys.dm_hadr_availability_replica_states AS ar_state ON ar.replica_id = ar_state.replica_id)
JOIN sys.dm_hadr_database_replica_states dr_state on ag.group_id = dr_state.group_id and dr_state.replica_id = ar_state.replica_id where ar_state.is_local = 1′
If exists (select * from #AGStatus where synchronization_health_desc <> ‘HEALTHY’ or synchronization_state_desc <> ‘SYNCHRONIZED’ or is_suspended = 1)
begin
select top 1 @database_name = [database_name], @ag_replica_server = ag_replica_server
from #AGStatus
Set @ErrorMsg = ‘AG Issue Found – Databases on ‘ + @ag_replica_server + ‘ are NOT Healthy’
EXEC msdb.dbo.sp_send_dbmail @profile_name = @DBMailprofile, @recipients = @ToMail, @subject = @ErrorMsg, @Query = @QueryStr, @attach_query_result_as_file = 0, @query_result_width = 1024
end
— Check log send queue size – if bigger than 20MB then alert
If exists (select * from #AGStatus where log_send_queue_size >= 20000 )
begin
select top 1 @database_name = [database_name], @ag_replica_server = ag_replica_server
from #AGStatus
Set @ErrorMsg = ‘AG Issue Found – Databases on ‘ + @ag_replica_server + ‘ have a LOG SEND QUEUE >= 20000’
EXEC msdb.dbo.sp_send_dbmail @profile_name = @DBMailprofile, @recipients = @ToMail, @subject = @ErrorMsg, @Query = @QueryStr , @attach_query_result_as_file = 0, @query_result_width = 1024
end
— End AG Health Check
SQL Server Alert Definitions
The following script will define 3 x SQL Server Alerts to determine if an AG role change has occurred or if data movement was suspended or resumed. The script also adds the Alerts to the specified SQL Operator for email notifications.
Note Database Mail, the Alerts and SQL Operator need to be configured on each server of the AG configuration.
— 1480 – AG Role Change (failover)
EXEC msdb.dbo.sp_add_alert
@name = N’AG Role Change’,
@message_id = 1480,
@severity = 0,
@enabled = 1,
@delay_between_responses = 60,
@include_event_description_in = 1;
GO
EXEC msdb.dbo.sp_add_notification
@alert_name = N’AG Role Change’,
@operator_name = N’<YourSQLOperator>‘,
@notification_method = 1;
GO
— 35264 – AG Data Movement – Suspended
EXEC msdb.dbo.sp_add_alert
@name = N’AG Data Movement – Suspended’,
@message_id = 35264,
@severity = 0,
@enabled = 1,
@delay_between_responses = 60,
@include_event_description_in = 1;
GO
EXEC msdb.dbo.sp_add_notification
@alert_name = N’AG Data Movement – Suspended’,
@operator_name = N’<YourSQLOperator>‘,
@notification_method = 1;
GO
— 35265 – AG Data Movement – Resumed
EXEC msdb.dbo.sp_add_alert
@name = N’AG Data Movement – Resumed’,
@message_id = 35265,
@severity = 0,
@enabled = 1,
@delay_between_responses = 60,
@include_event_description_in = 1;
GO
EXEC msdb.dbo.sp_add_notification
@alert_name = N’AG Data Movement – Resumed’,
@operator_name = N’<YourSQLOperator>‘,
@notification_method = 1;
GO