SQL Server Production Issues
Below are the common problems from production environment.
- SQL Server Log full alert
- SQL Server Blocking Alert
- Low Disk Space/Disk Full
SQL Server Log full alert
SQL Server log full
Log is full on TempDB or a user database. This is most often caused by a process has run longer than expected, has hung or performed a rollback because there was insufficient space in the database to proceed.
For User Databases
1. Verify drive space is not causing the problem.
2. Check SQL Agent jobs to see is one is causing the problem.
3. Locate running process verify if:
a. Are still running
b. Have completed already.
c. Are rolling back or have rolled back.
4. Check DB settings for caps on the files.
a. If a max size is set, expand the max size by a small amount to allow the process to finish running. Make sure that other files on the drive are not impacted.
b. If a max is not set and the drive is running out of space notify the customer of the findings and have them consider the following:
** Expand to another location. (Make sure to cap the file as well before expanding to
another location to prevent the file from taking all of the disk space left.)
** Grow the drive if it is on Storage Utility
5. If the following hold true attempt to shrink the log:
a. The process has completed or rolled back
b. There is free space on the Disk
c. There is free space to be gained in the log
1. Identify if TempDB is setup with multiple files.
2. Verify DB settings.
a. If there is a maxcap on the tempdb file and there is adequate, disk space increase it up a small amount to allow the process to complete.
IMPORTANT – If this TempDB is set up with several TempDB files you need to add a little space to each file equally to maintain the correct algorithm.
b. If there’s no cap set on the database.
** Verify drive space If there is a lot of space available than the process may have
already rolled back.
** Identify process using tempdb and what is driving them.
** Locate alternate location to expand files
(Use extreme caution when expanding to another drive. It will affect the other databases on the server. Try to expand only long enough to get the process completed and only to a drive other than E-H.)
As a last resort and only when SQL is negatively impacted by the condition consider requesting permission from the customer to restart SQL.
SQL Server Blocking
On occasion SQL server will throw alerts for blockage. A lot of times this blockage will pass on its own. In those cases let the customer know the blockage occurred and that it has passed. If it continues however this could indicate a larger problem.
Blocking often occurs because:
1. A SPID holds locks on a set of resources for an extended period of time before releasing them. This type of blocking resolves itself over time, but can cause performance degradation.(This type will often resolve itself however can cause a moving target of sorts as the blocking moves from one resource to another making this harder to troubleshoot overall)
2. A SPID holds locks on a set of resources and never releases them. This type of blocking does not resolve itself and prevents access to the affected resources indefinitely.(This type is fairly consistent and easy to troubleshoot normally)
Running SQLStat will verify blockage is still occurring after you receive the alert:
CONNECTING TO SQL SERVER:
TOTAL PROCESSES: 26
BLOCKED PROCESSES: 2
Once you have verified that the blockage is still occurring you need to identify the following:
1. The SPID causing the blocking.
2. The command being executed by the SPID.
3. If the SPID is still active.
4. What process is controlling the SPID.
You can either view things using Query Analyzer or Enterprise Manager. Since some tempDB blocking can prevent you from viewing the process’ through Enterprise Manager it is good to be aware of which query to run to get the same information.
From Query Analyzer you can run any of the following to get SPID information:
** SP_WHO2 – BlkBy column
** SP_LOCK – SPID Column
** SELECT * FROM MASTER..SYSPROCESSES
** SELECT * FROM MASTER..SYSLOCKINFO
** SELECT * FROM SYS.DM_TRAN_LOCKS – SQL 2005 only-
Use WHERE resource_database_id = clause for specific database.
A SPID value of -2 indicates an orphaned transaction.
Take these blockers and run the following to get the detail of what they are running:
Identify I/O consumption on long running tasks to verify still active:
SELECT SPID, PHYSICAL_IO FROM MASTER..SYSPROCESSES
Connect via Enterprise manager and view the following
1. Expand Group
2. Choose Server
3. Expand Management
For SQL 2000
** Expand Locks/Process ID
** Click Process Info
** Double Click the blocking Spid to capture the last command executed.
For SQL 2005
•Double click on Activity Monitor
** Scroll to the right and locate the “Blocked By” columns. These are your head
** Double click on Process ID to make note of what activity is causing the blocking.
** Note Process ID number for the head of the blocking chain.
Once you have identified the running command and verified it is still active look at some of the following:
If a SQL Agent job is spawning the command to run.
1. If the job is running over, notify customer of situation and to check
2. If the job is not running over, notify customer and monitor till expected job end
3. If this is a daily occurrence you may suggest they choose a different time to run their job when the server is under fewer loads and it will not cause blocking.
Not a SQL job but ad-hoc query – Notify customer and person running query of the problem and request they
1. Break any large, complex queries into simpler queries
2. Run the query during off hours or on a separate computer
Remember that certain SPIDs should not be killed or cannot be killed.
**Your own process – Cannot be killed
Those labeled with the following should not be killed:
** AWAITING COMMAND
** CHECKPOINT SLEEP
** LAZY WRITER
** LOCK MONITOR
** SIGNAL HANDLER
If blocking is persistent and an attempt to kill the SPID has left it hung you may need to either:
1. Restart SQL
2. Reboot Server
Run the following from a Query Analyzer to get an idea of how long a SPID may take to rollback:
KILL spid WITH STATUSONLY
In order to clear the issue. Make sure and discuss this with the customer first.
Disk Space Alert
Low Disk Space/Disk Full
Disk threshold alerts
There are two disk threshold alerts we are currently trapping:
1. Below 400mb (Sends email to MSE every 15 minutes)
2. Below 5% (Sends email to MSE every 15 minutes)
Drive Notations as Per General Production Standards
C Drive – All OS Files present.
D Drive – All the System Database files (log & data files) present in D:\MSSQL\Data folder.
E Drive – All the Database Backup files (Full & Differential) present in E:\MSSQL\BAK folder.
F Drive – All the Transaction Log Backup files present in F:\MSSQL\TRAN folder.
T Drive – All the data files of Tempdb database present in T:\MSSQL\DATA
H Drive – All the database files (only Data) present in H:\MSSQL\DATA.
O Drive – All the Log files present in O:\MSSQL\DATA. .
Identify the server type
Since we monitor several different types of servers, each with its own SKU, it is important to identify the type of server throwing the error to make sure you understand what each drive’s purpose is.
** File servers
** SQL Servers
** IIS Servers
** Servers on SANs or SU’s
** Look for the obvious files first:
** Look for failed jobs within SQL or Task Scheduler that may be causing the drive to fill.
** Check the Print Spooler if this server does a lot of printing. (A large print job that hung can cause the C$ drive to fill quickly. Commonly this folder is located
** Check for SMS Cache that has grown large.
Actions you can consider
For C Drive alerts:
** Contact DCOps for resolution.
For other drives:
**Contact the customer and ask them to remove any unnecessary files/folders.
** Compressing old files and/or the folder.
** Special Tape archive and delete.
a. Create a share where the data is located.
b. Create a Change Request for the Backup team to backup the newly created share to
tape and send offsite.
c. Verify successful tape save with Backups.
d. Delete data off server.
** Shrinking the files if they are from a SQL database.
a. Identify if there is available space on the database to shrink it down a bit.
** Delete completely. (Mostly refers to C$)
a. Old dump files from Windows and DrWatson.
b. Old Install bits from updates and upgrades.
** Move Print Spool to another drive.
** If this is on a Storage Utility, you may consider adding more drive space.
!! Always confirm!!
Always confirm with the customer before moving, removing or altering ANY of their data from a server.
Will be back with More probs and Sols…………………..!!!!!.