So Jason Massie has called me a many other out on this recent post. Ha!
Of course looks like Chris Shaw was the real instigator.
So, he wants us each to share a stupid mistake we’ve made regarding our roles as DBA’s. For myself, I certainly use the term DBA very lightly. Anyhow, that post got me thinking, what was the biggest stupid mistake I’d made over so many years working with DB’s that others could learn from?
Well Jason already hit an obvious one about always having a reliable backup. That’s bitten me a few times, mostly because the network backup solution had “good” backups, only to find out that I needed to restore (rollback) and the backup was actually bad. Yikes. Disk based backup has been more reliable for me, but even today, before I make a DB schema or large data change you’ll find me making a manually backup of the db to some other location.
So, what else could I confess to doing? Well here’s another one, related to backups, but really related to DBA diligence. Not too long ago, I found myself in a bit of a bind. Our SQL server instances have been humming along swimmingly for months. I’ve been heads down on some other not DBA related projects. Peers are working with our CRM solution doing mailing address validation through a third party. Typically export/validation/import type stuff.
I get called over after a few days of overheard head scratching. We talk about the data and potential data corruption that may have taken place a few days ago. Run some queries to identify the extent of the damage. Ouch! Problems are definitely there.
No problem, we’ll get the backup of the DB from 2 days ago, and do some data fixing between the old and new datasets. You know the story. Open SQL Server Management Studio and check out the last backup date/time and transaction logs….WHAT! 3 months ago? Yup, there was an application running on the network that was holding open connections to the database even when it wasn’t doing anything.
“Bad programmer”, I thought knowing good and well it was something I had slapped together to answer a need. Silly me. Well, the DB backups were part of the shared maintenance plan and it was setup to do database checks and fix issues prior to backups. Wow, really? Yup.
The maintenance task could not take the db into single user mode to correct an issue 3 months ago. That had been causing the job to fail and the backups not to run. UGH!
What’s the big morale of the story? Diligence! I had gotten away from the basic of tasks, daily care and feeding of our environment. Sure, if it was a 2005 or 2008 environment I would have had those darned error emails, but we all know what a pain mail was on 2000 so nobody had ever set it up.
You will now find a three ring binder on my desk with a daily checklist in it. Believe me, all DB’s, jobs, etc… are looked at almost every morning now. Don’t get bored of the mundane, it will bite you.
So, there is my humbling story just for you Jason!
Blessings.
Popularity: 10% [?]
At Pythian (www.pythian.com) we’ve taken the “daily checklist” one step further and automated it — we have checks to make sure that everything is as it should be, especially for answering questions like “did the backups run?”
Our daily checklist is actually a checklist on a web page. There’s a checkbox for each machine on each client with a status next to the checkbox.
If the status is “OK”, it means all the checks have passed. If the status is “WARNING” or “ERROR”, it means one or more of the checks has problems.
Every weekday we “do the dailies”, and we don’t check the checkbox until we have investigated the problem, or at least made a ticket to investigate the problem.
This particularly helps for the “OK” checks — we don’t have to look at anything, and when 2/3 or more of your machines have a status of “OK”, it saves tons of time.
And when the remaining 1/3 or fewer come out as not OK, it’s easy to see what’s not OK. Maybe the backup failed, maybe the available free tablespace is starting to get low (we page when it’s really low, but our daily checks catch it when it is getting there, so we have plenty of time to fix it if we need to — adding tablespaces in MySQL requires a restart).
Yes, that’s a good point. We’ve actually started using the SQL DBADashboard on some of our servers as a starting point.