Repairing Suspect SQL Database
I'm 
not a DBA but I have a real life experience repairing a suspect databases twice. I guess I can say that I was 
unlucky at that time as I was presented in a tough situation where I was still 
in the process of learning how things work and we didn't have a DBA resource 
(remember I am a developer), but still consider it lucky as even some 
experienced DBAs might not have even experienced this situation. I experienced a 
similar situation before, and restoring it was easy before as we have, but that 
was easier as I have the proper restore points.
Now 
in this situation, the backups had failed so the last clean backup we had was 2 
weeks old, so I could only rely on that one plus the current transaction log 
which is not backed up for now. So the first thing that came to my mind was to 
backup the transaction log so I could easily restore the full lot. I tried to 
backup the transaction log, but I had to remove the suspect status first by running this TSQL:
sp_resetstatus 
Now 
it was not a suspect anymore, I was hoping to run the backup of 
transaction log by using this TSQL:
BACKUP LOG 
TO DISK='X:\YourFolder\YourFileNameLOG.bak', 
But 
to my surprise, it did not work as the database went back to a suspect status. I repeated the process again 
hoping that it would work but got no luck out of it, with further research 
its suspect status will always go back when the 
corruption is severe, so the next thing that came into my mind was that the disk 
might fail more so I have copied everything to a remote disc (MDF, LDF and the 
Backup), so I have to kill all connections to it and stop the database so that I can copy the files properly. 
After waiting for a long time around 15 minutes (remember the file is huge), it 
presented me an error which I cannot further, I cannot remember the exact error 
anymore but it's similar to copying files from a badly scratched CD or DVD, the 
error is in the lines of an I/O error (Oh sh*t!). So I went ahead and checked 
the event logs and to my surprise, here is what I got!
Everything 
is in red nearly all 10K events, don't ask me why this happened without someone 
noticing (remember I am the developer).
The 
message is almost the same from event to event, the HEX part is almost 
repetitive, and looks like it's using the same 20 values all over again.
Error: 823, Severity: 24, State: 3
I/O error 2(error not found) detected during write
at offset 0x000001d527a000
in file 'E:\Program Files\Data\MyDatabase.mdf'.
So 
moving ahead, the first thing that came to my mind was to check the disk and 
defragment as it might help as the file system would be repaired if there were errors that are repairable and I was hoping that the data would be 
moved from the offset mentioned logs to a better location on the drive. It took 
nearly 6 hours doing it as the data was large and drive was large, but after 
that I now tried copying the files I mentioned above and it worked, now I have a 
clean copy in a good disk of the MDF, LDF and the Backup which gave me peace of 
mind, as I know I can restore from those (but using those is not easy as the LDF 
is not a backup copy).
From 
there knowing that the disk was repaired by the checkdisk and defragmentation, I 
can now backup the transaction logs and the database which did not give me much problems 
anymore. Also having that in mind, I can now perform a full backup of the database which I can restore on a new server. And 
this is the TSQL I used:
BACKUP DATABASE 
TO DISK='X:\YourFolder\YourFileNameDB.bak', 
That 
worked as well, so the checkdisk and defragmentation really helped.
Now 
I have a real backup, next is to restore it on a new machine which again did not 
give me a lot of issues until…
Once 
the applications start connecting to the database, I checked the logs and to my surprise 
the same errors (Error: 823) appeared, now I tried contacting some of the DBAs I 
know and they could not give me a convincing answer, so I just trusted my gut 
feeling and did it on my own with the help of Google. Now I found out that the 
error might be a remnant of the old database which was resorted in that state so the 
best thing was to repair it before applications start using it so 
I repaired it using this TSQL Command:
USE [master]
GO
ALTER DATABASE  SET SINGLE_USER WITH ROLLBACK IMMEDIATE
GO
ALTER DATABASE  SET SINGLE_USER
GO
DBCC CHECKDB , REPAIR_REBUILD) WITH ALL_ERRORMSGS, NO_INFOMSGS
GO
ALTER DATABASE  SET MULTI_USER
GO    
So 
what it does is, you set the database to be used only by you then you repair and rebuild the indexes, suppressing all 
information messages except for errors, then set the database back for multi user use.
There 
are a lot of repair options and this is the most time consuming 
one so you need a lot of patience for at least 6 hours for a huge database, there are some other options and here 
they are defined.
Here 
is the complete list of options:
- REPAIR_REBUILD – Performs all repairs done by REPAIR_FASTand includes time-consumingrepairs while rebuilding indexes. This will not have data loss and this was my first option which worked for me.
- REPAIR_FAST – Performs minor, nontime-consuming repair actions without rebuilding the indexes and will not have a data loss as well.
- REPAIR_ALLOW_DATA_LOSS - Performs 
all repairs done by REPAIR_REBUILDbut you allow data loss as the irrepairable corrupted text will be deleted, this was my last resort.
So 
after running that script, here is my result:
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, 
 alloc unit ID 0 (type Unknown), page ID (1:2425877) 
 contains an incorrect page ID in its page header. 
 The PageId in the page header = (0:0).
 The error has been repaired.
CHECKDB found 0 allocation errors and 1 consistency errors not associated 
 with any single object.
CHECKDB fixed 0 allocation errors and 1 consistency errors not associated 
 with any single object.
Repair: The Clustered index successfully rebuilt for the object 
 "dbo.MySampleTable" in database "MyDatabase".
Repair: The page (1:2425877) has been deallocated from object ID 462321894, 
 index ID 1, partition ID 311773704355840, alloc unit ID 30298727645184 
 (type In-row data).
Msg 8945, Level 16, State 1, Line 1
Table error: Object ID 462321894, index ID 1 will be rebuilt.
 The error has been repaired.
Msg 8928, Level 16, State 1, Line 1
Object ID 462321894, index ID 1, partition ID 311773704355840, 
 alloc unit ID 30298727645184 (type In-row data): 
 Page (1:2425877) could not be processed. See other errors for details.
 The error has been repaired.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 462321894, index ID 1, partition ID 311773704355840, 
 alloc unit ID 311773704355840 (type In-row data). 
 Page (1:2425877) was not seen in the scan although its parent (1:2425841) 
 and previous (1:2425876) refer to it. Check any previous errors.
 The error has been repaired.
CHECKDB found 0 allocation errors and 2 consistency errors in table 'MySampleTable' 
 (object ID 462321894).
CHECKDB fixed 0 allocation errors and 2 consistency errors in table 'MySampleTable' 
 (object ID 462321894).
CHECKDB found 0 allocation errors and 3 consistency errors in database 'MyDatabase'.
CHECKDB fixed 0 allocation errors and 3 consistency errors in database 'MyDatabase'.
Looks 
good from that point as it repaired any errors it found, so the last thing to 
do is backup your spanking shiny new repaired database in case something happens and create a 
good maintenance plans and backup.
So 
to summarize, here are the steps to follow in repairing the suspect database if the culprit was an I/O error:
- Kill all connections and stop SQL Server
- Perform a Checkdisk with repair option
- Perform a Defragmentation
- Now copy the files necessary (MDF, LDF and Backups)
- Perform a backup
- Restore the backup
- Repair the Database (DBCC CHECKDB)
- Backup the repaired DB
- Have fun!



 
 
No comments:
Post a Comment