Name: Stop Firefighting: The DBA Burnout Crisis
Uploaded: 2026-03-04T17:16:08.647Z
Duration: 1 h 8 s
Description: Stop Firefighting: The DBA Burnout Crisis

Transcript for "Stop Firefighting: The DBA Burnout Crisis": Hi, everyone. Thank you for joining me for the high performance DBA, Beat the Burnout Crisis. My name is Kevin Klein. I'm a longtime fixture in the database industry. I've written some some very popular books like SQL in Nutshell, and I'm a bona fide expert. I've been awarded the Microsoft MVP accolade 18 times over the years in the data and analytics platform, similar on the side of AWS. And the main thing I want to express is that I love to hear from those of you who listen to my presentations. So reach out to me on social media at any of those links shown there. All of them are K E Kline, as well as send me emails at kevin. Kleinsolarwinds dot com. So today we are going to take you through this phenomenon that's getting more pronounced for those of us in the data and analytics world, which is burnout. We are getting so burned out, and it's very, very stressful and difficult for us. And the thing about it is that there's also a lot of mismatches between what we as practitioners of data administration know and deal with and what our executive leadership know and deal with. They don't always come together. There's some, there's a gap in there. So we'll talk about that by the numbers and, and we'll dive into some of the research that we have performed recently with some surveys and such, and show you what is going on. So what we're talking about today is the DBA burnout crisis. I'm going to spend some time and take you through the numbers of some research that we have recently conducted, over 1,000 participants in our study, as well as the second half of our presentation will be about techniques you can use to make your life less stressful and make that burnout situation much less likely to happen. So with that, let's dive in. Let's take a look at the numbers here. What we're seeing in this large survey, large and small companies, Fortune 500 down to small mom and pop shops, is that there's a divide between cloud based database administration and on premises database administration. For DBAs, it's a lot more stressful if you're living and working in cloud based databases because for decades, capital and budget were not part of the DVA's daily routine. Cost optimization was not something you thought about every day. But in the cloud, every cycle, every IOPS costs. And so that's extremely stressful when you never had to deal with budgets and costing and cost optimization. And now your company moved to the cloud and suddenly that's all they talk about is what that monthly bill ends up being. So cost overruns really add a lot of stress to our lives as DBAs for cloud environments. We find that the cost overruns occur primarily in areas like operations, backup and recovery. A big difference between on premises and cloud, for example, is that every backup that you take and retain, it costs as much as the first backup. Whereas on premises, we could have multiple copies of backup files, and it doesn't add to our cost because we've already paid upfront for all of that storage, but not in the cloud. So it's costly and it adds, it accumulates in ways that we're not used to as DBAs. Now, intuitively, most of us as DBAs also know that bad SQL and bad database design, for example, choosing bad data types, choosing inappropriate indexes, or not having the right indexes, any of those kinds of things means that your database applications have to work harder. And because of that, more CPU, more IOPS costs a lot of money. Right? So whereas no added cost on premises for bad performing SQL, because you've already paid all your costs up front by buying the server, buying licenses, the database and so forth. It's an expensive proposition. And it's also something that DBAs and IT executives don't see eye to eye on. There's these cost overruns are often the kind of thing that is, you know, a significant issue. And the managers, the executives are curious. Why as you, you as a DBA aren't managing this well? And it turns out it's, we've never managed it well. We never had to manage it at all for that matter. So it's, it's significant. We're not able to do the things that add value to the company when things are broken or costing too much or we're fighting fires all the time. We're basically just coping with technical debt. That in itself is a big deal, a major kind of deduction on our ability to provide value to the business. The other thing that is super, super significant, and many DBAs live in this world, is that all we have time to do is fight fires. In our survey, we found that about twenty seven hour the average DBA spent twenty seven hours per week out of forty hour work week just being reactive. We're dealing with trouble tickets, we have alerts going off, performance isn't good, or somebody calls up and says, you know, I need you to fix this problem. Performance is terrible for us running our reports. Maybe we have AGs that are failing over unexpectedly, and that's impacting performance. So there's a number of things that we are dealing with that is reactive rather than proactive. And it's a difficult time. We also see that with all of these tickets being fired through automated mechanisms, and we see all of these requests from our users coming in through those automated ticketing systems, that they're not smart in that we haven't built them to prioritize different kinds of requests. And so every request that comes in is equally important, which means that we spend probably 75% of the DBAs out there are really, really struggling with alert fatigue and noisy alerts, alerts that don't improve the situation. They're not actionable. They're not informative. And so this is a huge issue. 49, almost 50% of the DBAs out there say that they get so many alerts and trouble tickets and requests to fight fires that their alert fatigue is severe. And this is again a very big issue. We are exhausted. And if you ask DBAs, what's the number one thing that you're not thrilled about in your job? Most of them will tell you pager duty because we are getting awakened all the time in the middle of the night. And that is a miserable situation to be in. You have the ball game with the kids, and now you have to get up and leave and connect to the network and try to figure things out. And this falls pretty, pretty heavily into an area about the tyranny of the urgents, of the urgent. There was even a book written about this, and it's something that is pretty well researched, where you're so busy reacting to these fires that pop up that it's, it's a tyrant. It orders you to do other things that are important, but not urgent. And so what we find is that there's in our studies have shown that we have two kind of categories. We have leaders who are able to monitor their entire environment and have a really cohesive, unified view of everything that's happening in their environment. They spent some time weeding through all those alerts and, getting rid of the noisy alerts that don't really provide actionable information or something you can respond to. But on the other hand, we also have laggards. And laggards are, as the name implies, they are not doing things quite as well as the leaders. They're always responding to the urgent, but they are not dealing with what's important. Okay. So on the previous slide, I mentioned technical debt. Well, so much of the time, if we can retire technical debt and those decisions that were made under urgency that, oh, we'll just do this real quick shortcut and kind of build the kludge because we got to get this done quickly. Well, sooner or later, you need to pay that debt back, right, and build a real solution, not some kind of patched together kludge. And so we find that laggards allow kludges in the first place, and then they grow and grow and grow over time. So it's a situation where by always being under the tyranny of the urgent, we never get the important stuff done. And so as DBAs, our role is keeping things afloat, but it is not actually enabling us to sail the ship better or to, you know, to carry that analogy forward. We're not adding value. We're just keeping it running. And that's very stressful. You know, there's a lot of data in the report as well that says that if you are able to get these factors under control, you actually will perform much, much better for the business, yes, but you're also under a lot less stress. Your job feels better to work in. You feel more satisfied in what you do. And so, it's one of those things that it changes everything about the job. Another thing that we found that, is really important to DBAs today, and it's one that upper management is missing out on. They do not connect with this idea, which is I need training. We as DBAs and technology professionals live our lives with a version number. And so if I'm still really, really focused on SQL Server 2019, well, that's six, seven years old, right? So I'm missing out on a lot of new features and capabilities. But because I'm so busy, I can't get up to speed on the newest stuff. I can't learn about generative AI and inform management about better options that we might have than what we're currently now. So we can't add value as long as we are dealing with all of this stuff that's really urgent, but not truly important. And so, there's a whole decision matrix around us, around how we can get better at moving into that leader quadrant. And to move into that leader quadrant, management needs to give us as DBAs time to train and time to work on the things that are important, but not quite as urgent. For example, maybe doing a code review, of all the code that we have out there and doing some analysis to find out which SQL statements are performing the worst, and let's fix those even though nobody called to complain. There was no fire going, going and burning down a server, for performance reasons at that time. Right? So that's where we want to move to where we're able to work on things that are important, but aren't as urgent. And of course we always want to deal with those things that are urgent, but we want to prepare ourselves in such a way that we make it easier. That's what the second half of this session is about. Some of those steps we can take that will make our lives so much less stressful. And I know from firsthand experience that I have implemented many of these things in the big shops that I worked at and where I led the data management teams, and they do make a really big difference. I can testify to that. So in the next section, I'm going to be teaching you some techniques and I will also be switching over to show you some of the demos of the capabilities, either in our tools, or maybe I'll show you some free tools that will make this a lot easier as well. When I talk about our products in the database space, I'm going to be talking about three products in particular. We have DPA, Database Performance Analysis. And this is the product that we typically lead with. If you were to call us up and say, I need database monitoring. What it does is it monitors over 30 different database platforms, and it is oriented around troubleshooting analytics and performance root cause analysis for databases from the standpoint of weight statistics first. Okay. The idea is weight statistics put us into the exact area of the ballpark where we need to fix problems such as say, lock contention. And then from there, we can drill down to the individual SQL statements and other issues. Maybe we're missing an index, for example. Our other major tool that I'll be showing you some demos from is called SQL Sentry. SQL Sentry is our very, very deep product for the Microsoft data platform. So it's very deep on SQL Server, Azure SQL. It covers analysis services and Azure Synapse, the the whole ground to cloud sort of stack for Microsoft. And it primarily leads with a telemetry point of view. So we'll show you all the performance telemetry first, as well as weight statistics. And then from there, you can drill down to see what is the worst performing SQL or other operations like that. And our third product is our SaaS product, which is part of our SolarWinds observability platform. It's called Database Observability. And so this is also a multi platform monitoring product. It also covers more of the NoSQL database platforms too, if you're interested in that. It's a SaaS first orientation. And so it is also able to provide you a wide view across all your databases, as well as give you all of that normal sort of kind of monitoring capability that you would expect with things like query performance. And if there are any particular problems with weights or locks or blocks or deadlocks. So let's talk about the burnout crisis. And one thing I do want to point you to is that we do have all of our demos online. So if you want to see SQL Sentry online, DPA or DBO online, you can go to the interactive demos page and see all of those. In addition, the SQL Sentry demo I'm going to show is our thick client. If you were to go to interactive demos right there, that would show you the thin client, which is a little bit different in the way it looks. But all the features of that, in fact, there's a few extra features on the on the web client that I really like. So where did I learn all these techniques and tips and tricks? Well, there's something called the Capabilities Maturity Model. And I was working for the Department of Defense when they developed it for the very first time in cooperation with Carnegie Mellon University and their software engineering institute. And what they did is they basically characterized all the ways that the different vendors that provided software services to them, where they failed and where they succeeded, and what characteristics of maturity they displayed to make them much more likely to complete projects on time and in budget. And so what we have is five levels of maturity. And at the level one, it's chaos. Anybody who wants to could call you up and say, Hey, fix my problem. They might call up two or three people on your team. They might go to their boss and then their boss comes to you. And there's no documented way that this is supposed to be done. There's no processes. Everything kind of happens off the cuff. And that is absolutely the world of tyranny of the urgent. Everything is urgent all the time. Once we get to level two, this is where we start getting better and we start doing things differently. For example, in order to get control of your environment, you absolutely have to have monitoring. It doesn't have to be hours, but it's, you got to have some kind of monitoring. There's open source tools, for example, that you could use. But if you don't have monitoring, you can never truly know what is normal. And if you don't know what is normal, whenever you react to a fire, you're doing it from a point of ignorance that you don't know if these, if it's actually a situation where things are normal or if they're truly abnormal. And so that's one of the major things you have to do. Another thing you have to do to get to that level two where you're getting past reactive is you have to have an inventory. You have to know all the servers that you have to support. You have to know at least how to do basic alerting. Maybe I'm running out of disk space. Maybe I am having CPU at 100. So you've to get those basic sorts of alerting capabilities in place. From there, you start to build on your alerting capabilities. So now we can automate because we know we can run a whole bunch of jobs between 3AM and 6AM because the system is so quiescent. It is so quiet that we can schedule all our jobs to run at that time. But we wouldn't have learned that without monitoring. That's how maturity models work. You get one practice or process down, and then you add onto it to make you more mature and more capable. Maybe you've got a trending analytics, so you can tell what time during the year would be best to deploy a major new technology or information system or something like that. You start to do change management, configuration management. And then once we get to level four, this is a really well organized, DBA team. And at this level, we can say, oh yeah, I can sign that SLA. Or you can look at an SLA and say, Well, there's no way I can bring this server back in thirty minutes if you're only going to give me a budget of X, Y, Z. I need to have redundancy. I need to have high availability to meet your SLA. But you can say that because you have gone through the earlier processes of maturity in which you've monitored and you know things about your internal environment. You've built processes and analyzed your, trends over time. You have seen because you are doing database restore testing, you've seen that it takes you at least an hour to get that back up. So at level four, we have learned all of those things and built upon those so that we can understand how our business works. And then finally, once we get to level five, which I think is where we all want to be as DBAs, this is where you get a seat at the table with the business decision makers. And you're considered a very valuable partner to the business because every time you sit down at the table with them, you're adding value to the conversation. So the business people might be saying, Oh, well, we need to build a new WordPress website to handle this and such kind of interactions with our customers. And you might say, Wait, wait, wait a minute. That's old school technology. Why don't we use this new technology? And they will look at you and they'll say, well, we trust this person who manages all of our database technology, and this is a good recommendation. We at least have to listen. So you're now in a situation where they listen to you and want to hear what your opinions are. So it's a really good place to be in. Today, however, the techniques I'm going to focus on are how to get us into level two and level three. And so once you get these things in place, you're really going to be performing better. So the first thing that we have to do is ensure consistency and standardization. The approach is kind of cookie cutter. But at the end of the day, we as DBAs need to be able to provide predictability and stability. It's better for us as DBAs to be consistently B plus or A minus in terms of the services and capabilities we provide to our end users, our business users. It's better for us to handle things in that way than it is for us to be occasionally spectacular and occasionally not good because people only remember the bad parts about how how you executed on things. And so one of the first things that we have to do is make sure that all of our servers have a specific workload capability are identical. We don't want any unique servers if possible, unless we have one application and database that is far different than everything else. But if that's the case, then we have that one special buckled down SQL Server. It's got its own characteristics and everything else is standardized. We want to know that all the patches are the same. All the version releases are the same. The BIOS is the same. The HBA adapter drivers are the same. The servers are all the same brands, same memory, same CPUs, and so forth. That eliminates just a lot of time during an RCA session or root cause analysis session. Also, if you haven't ever done it, you're going to want to build an inventory, right? This is, again, one of the things that help elevate my performance as a DBA, my team's performance, by building out that kind of information and having it available for track back. And there's a couple of ways you can do that built in. When you're using SQL Server, it's harder on other database platforms unless you have an external tool to do that. In SQL Server, we have both the centralized management service and we have policy based management. In older versions of SQL Server, we also had something called desired state configuration. That was deprecated at one point, but there is an ongoing active project on GitHub for SQL Server Desired State Configuration Management as well. And these are three different techniques you can apply to make sure all of your servers are consistent and standardized. The thing that I say to my teams is if we have to RDP into a server to check on some of its behaviors, some of the details that dictate those behaviors, like what's its version number, that's wrong. We've already made a mistake that we don't have that listed somewhere. And, you know, I want to reward my teams by making them able to work on high value stuff. So wasting 30 to look up, you know, what's the version on these three servers? That's, that's kind of wasted effort. We should already have that prepared. We should have that automated. So it's kept up and maintained without us ever having to intervene. Now, one of the things too, that we we also have to be aware of, I mentioned predictability and stability are features. Another thing we deal with a lot is DBAs actually like being heroes. There's plenty of DBAs out there who have enjoyed that late night call because in the morning they'll know that their boss is going to say, Man, you did a great job. So glad you came into the office and fixed everything else. However, in my opinion, we have to kill that culture, the hero's culture, where you swing in like Tarzan on a vine and you make things better. Instead, I like the silence culture, right? The idea that a quiet week is a good week. And so I would do little things for my team where I'd say, you know, I'd have an award for a quiet week. And when we'd accumulate these quiet week points, then would use them for a group reward of some kind. Sometimes I'd pay out of my pocket and we'd go to lunch or something like that. So if the patriot doesn't go off, that's success. Right? And so we want to we want to reward that behavior. We didn't have to log in over the weekend. And finally, one of the things, too, that we need to make sure that we do is that we conduct postmortems after we do have an incident, if you will. And the postmortem should include our people and processes, not just the particular technology alarm that fired. You know, after a late night incident, we don't want to just ask why the log drive filled up. We want to ask the team members whose sleep was impacted and say, you know, what would you need? What changes would we need to implement so that you didn't get woken up in the middle of the night? And that's where we start to get away from that hero culture and start to move into a culture of the more silent we are in our ability to maintain. That quiet week sort of mentality is a better indication of how successful we are at our job. Okay? So when it comes to ensuring stability and predictability, we need an inventory. We need to know everything that is happening across our enterprise that we are responsible for. One of my favorite tools, which is often said to be deprecated by Microsoft, is the Microsoft Assessment and Planning Toolkit. However, I keep seeing it come back. The most recent update was in 2024, so it's not that far out of date. It's a simple executable that you download. It stores a ton of information into Excel spreadsheets, and it tells you everything about your different servers. So it even tells you about SharePoint servers and Exchange servers and Windows servers. But with SQL Server, you get a full inventory of everything that's happening inside of that SQL Server from patch and hotfix numbers all the way up to table designs, database options, and things like that. Very, very powerful. Still around. Might be a deprecated in favor of Azure Migrate Toolkit in the near future, but we can still use it. Another tool that I very much enjoy is called SQL PowerDocs. There's two versions. The original version written by my friend, Kendall Van Dyke at Microsoft. And this version is a little bit more updated by Sheep Reaper on GitHub. It's been updated as recently as three years ago. And so what this does is it runs PowerShell commandlets across your enterprise. We'll find all of the different SQL servers out there and collect that information for you in an inventory. With both of these, you need to be aware that it looks like a virus trying to attack your enterprise. So go and tell your network admins that you are going to run a scan to collect some inventory information. Otherwise, you could get in trouble once they trace it down to you. I also mentioned the DBA, dbatools.io, and there's even a full chapter in a book by Manning press called Learn DBA Tools in the Month of Lunches. And it shows you a lot of amazing information you can get out of DBA tools and it provides a kind of a scripted framework for you to do all of that. There's also just a couple of very easy PowerShell commandlets you could invoke yourself if you wanted to do that collection without using what's recommended in the book by Manning. The next thing we need to do is we need to implement and take advantage of all the built in alerts that are available to us in our database system. SQL Server is especially rich in this regard, but we also need to own and manage those alerts in the same way that we do other kinds of assets in our data estate. Because if we don't do that, we get noise. And remember what I mentioned earlier? Seventy five percent of DBAs said that they struggle with alert fatigue. It's a silent killer of DBA sanity. And if you get 200 emails a day from SQL Agent and you're ignoring all but 10 of them, that's 190 emails too many in a given day. We need to own and manage this. And alerts, you know, are like the boy who cried wolf. If he cries wolf too many times, he is ignored so that when the wolf finally does show up, we're not paying attention anymore. So in this circumstance, we do have the SQL Agent alerts that are built in. This is well documented. There are many blog posts about this. So in SQL Server Agent, you can enable it to send you an error message, an alert, anytime an error appears. Some people debate, some people say you should start at severity level 18. I actually go from 16 on up now. I used to always recommend 18, but now I've changed my mind. I actually recommend at least 17, and sometimes it's good to keep track of level 16 severity errors all the way up to level 25, which is the most extreme sorts of alerts. Now, I do different things with those. So I don't actually send myself emails unless they're a severe alert number or alert severity level. So a lot of those, I just want to know if they're happening and how much and when they're happening. And so I can correlate those back to individual processes inside of SQL Server. You can also do all of this with extended events. And I'll show you a little script I wrote in a moment that would enable you to live entirely without SQL Agent Alerts. It's all done out of SQL Extended Events. And you can also do some additional cool stuff with it too. Again, the idea is to script everything you can automate, everything you can. And so if you are going into SQL Agent and Management Studio and enabling all of those clicks, all of those severity level errors by clicking through the UI, you're probably doing it wrong because we make mistakes, we get tired, and we fat finger things. Whereas if you have a working script, once it works, pretty much always works even with new versions of SQL Server. Also, have so many alerts, as I said, where we want to know if they happen, but we don't need an email for them. So, how we respond to an alert is very, very important. Sometimes alerts should just be tracked in a table. For example, a SQL Server gives us MSDB database that has that kind of information that we can query. But also we may want to track for trends and things like that. So you can also do that in our monitoring tools. And when we see that we have a lot of these problems, we can also take it a step further and automate a response to those particular problems. If we see that we have a log filling up, well, the steps that it takes to clear that log are pretty consistent. For example, we truncate the log, or maybe we have some files on the file system that we can delete, empty the trash bin, for example. And that gives us enough room that the log is no longer full. And then, as I mentioned, we want to go through on a regular basis and make sure that we are eliminating those noisy alerts that don't provide us any actionable information. I like to point out that this is fireproofing rather than firefighting, right? And that changes the whole balance of how we approach our day. If we're always firefighting, we're always going, Oh gosh, what's going to blow up today? But if we're fireproofing, we're making things better for tomorrow, we know that next week and the week after, we'll have less alerts because we have figured out how to automate the response to those alerts. Get rid of all those alerts that have no action unless you want to do something like, you know, keep track of a tally. But the habit stops sending you messages if there's no actionable response to it. And we also want to audit those alerts on a regular, like I said, just like they were bugs, a P1 to a P3 sort of bug. Early in my career, I realized if I was judging these capabilities and success of my data estate by how much the phone rang, then I was making a big mistake. Don't depend on your users to tell you what is going wrong. What happens is your users have already had a degradation in trust for you when they have to call you and get you to fix something, or they have to call the ticketing system, you know, enter something into the ticketing system. They're already mad. Whereas if you can find those problems and fix them before they ever know about it, slowly but surely, you'll be good to build trust and credibility. So let me show you one example that I was mentioning a moment ago. One of our tools, SQL Sentry, has an extremely rich set of capabilities around alerting and alert responses. For example, we could take a look at all of the alerts from a log point of view inside of SQL Server. So this would keep track of all the different kinds of global alerts that are happening. We can also look at our advisory conditions. These are specially written alerts that focus on very detailed aspects of database performance. We have over 100 of them already pre written for you, in which you can see all of those specialized alerts, whether they're critical, high, medium, or low. It could be anything from long running open transactions to a great deal of latency on your disk drive. It could be a set of performance counters from Azure Data Factory that tell you a particular pipeline has been very, very slow for some reason. You can write your own as well as take advantage of the many hundreds that we have as well. We can also look at any of our servers from a kind of a, I guess you could say an outlook kind of metaphor. And so when we look at a server that has a lot of scheduled tasks, we see this little tiny icon here that looks like a clock on the wall. That means it's a SQL agent job. We can see that this SQL Agent job update measures was kicked off at about twenty five minutes after the hour. That little blue mark there tells us that's how much time it took to run-in that segment. If it's red, that means it's problematic. For example, this is a deadlock that happened. We have also in looking at the different icons here, lightning bolt means it's an advisory condition. This little symbol here indicates that a top SQL, threshold was exceeded. And so we can see all of these kinds of alerts happening, and we could tell that they're real and actionable. Something's going on that needs to be addressed here. So that's a very powerful aspect of it as well. Another thing to show you is that you can get all of this in the Microsoft Data Platform without using SQL Agent if you have to. So for example, you can use extended events. Here I've written a script for an RDS, Amazon RDS SQL Server. And I've said, capture all of the standard severity level of errors of 16 or greater. And then I also went in and added some tracing for a variety of very specific error message. For example, 824825 are known to be really big issues for IO. So I've added all of those different kinds as well. And this was based on something that a Microsoft program manager named Pedro Lopez wrote many years ago. So this is, you know, some very, very powerful capabilities. When we are using Database Performance Analyzer, we also get a variety of different built in metrics that we track and alert on. And you can also create your own custom alerts as well. The information that it stores is a little bit we get a little bit more information than what we're seeing here, but it is a bit less granular than what you would get with SQL Sentry. Now let's talk about automation. And for what it's worth, I do have a full one hour session just around automation and ways to really up your game in automation. And that particular slide deck and presentation did not include the newest things we can do with agentic. Ai, which is extremely powerful. Probably should do a new version of that that includes AgenTic AI. So when we're automating, that's a way we can take the toil out of different activities we have to do. If I have to do the same thing more than three times, I'm going to try to automate it. And by that, I mean, at least have a script available to me that all I have to do is provide a couple input parameters and it will do the rest of the work. However, there's a few that I really want to draw your attention to because they're just industry standards in the world of Microsoft databases, at least. For example, we have Ola Hallengren's management suite, and this does all of the preventative maintenance that you might want to do for a given SQL Server. Does your DBCC checks, looks for corruption, suspect pages. And then it will perform your backups. It will enable those to in a variety of ways. For example, if you need it to be a faster backup, it could spread that across multiple backup files. Very, very capable, beloved everywhere. Another, as I mentioned, is the DBA tools set of PowerShell scripts. So we have the copy database commandlet. We have the sync DBA availability group commandlet to make our AGs synchronized. Kind of stuff that would normally take us a long, long time to do by hand. It does in the click of a button. For example, one I love is that you can say, take this database maybe in production because the QA team wants to test against the production level of data and copy it over to a different file that they can attach and use themselves. And that used to take a long time for me to do. And now just click a button, the script runs and it takes care of all the rest for me. Also, we want to do as much scripting that will help us heal and self heal things. So I mentioned things like a transaction log filling up. There is a variety of scripts in commandlets in the DBA toolkit, DBA tools that will enable you to do lots and lots of things that are either self healing or grooming and care and care taking of your databases. For example, you could set up an automated rule for it to regularly, regularly store all of your certificates, your different tokens and things like that, and have that as part of a distinct part of your backup and recovery methodology. So really, really cool things that you can do. Also, there's some processes I, I encourage, especially if you're in a large multinational kind of company, you want to build a handover ritual, in which you're following the sun in terms of your support. And so you want to have a kind of a process by which, let's say here in The USA, we hand off to our colleagues in India, and it is a formal handing of the baton. And so we know that they have taken up what we have handed off. And, maybe they're going to continue to remediate some different activities, that we have started to troubleshoot and they're going to, carry on with that. Make sure that we have a good uptake on their side. Yes. But more importantly for us about to do the handoff, that knows, that lets us know that we have done everything properly and we can now relax, right? It gives us better work life balance because we know we've given it to someone who is going to handle it and we can now let go of that stress. And our colleagues in India will continue to work for their workday and then they'll hand off to our colleagues in Europe. Then again, they hand off and turn back to The USA. So you want to use that follow the, sun sort of strategy in a situation like this. Also, I like to use, some of the techniques that are, recommended for the airline industry. For example, I like to use a technique called the sterile cockpit. And the sterile cockpit says that, if you're in a really important, you know, you're in an urgent situation with your airplane, you don't let every flight attendant, open the door and say, Hey, is everything okay? Keeps you from being able to do your job of fixing this urgent fire. So this is one area where I encourage you to, make an agreement with your boss, the head of that team, that if I tell you we are in a heads down, very urgent firefighting moment, everybody comes to you for information and only you come to us. We don't have to answer 12 users calling us saying, when's that server going to be up? We tell you and then you in turn tell them. So that's one of those processes that really helps us get better at what we're doing. And so it helps us even go further in terms of things like capacity planning for humans. We kind of want to have a situation of the eightytwenty rule, except that it's inverted from what we normally think of as humans doing 80% of the work in automation, taking over 20%. We want to flip that. We want automation to take care of 80% of the stuff that we have to deal with. All of those urgent but not important things that keep happening, password requests or something like that. If that's not scripted, then you're wasting so much time, you know, updating passwords and permissions and things like that. So only 80% of your week should be scheduled as well. You should have 20% of your week or about, one day a week. Maybe it's sliced up into individual hours, but you should have one day a week in aggregate that you can use for enrichment. You can use for kind of technical improvements and education and so forth. Now, again, I mentioned all these different things that you can automate. Our tools have it. There's other tools that you can use as well as the built in features of SQL Agent. And finally, let's talk about streamlining your your health checks. Now, this is another thing that you should be doing on the regular. And one of the bedrock principles of being able to do health checks is that you have to do first a baseline. So you have to record the telemetry that comes out of your monitoring system and aggregate that in such a way that you know what your averages are, your median value is, what the high watermark is, the low watermark is for a variety of different performance metrics. This is extremely important. If you don't have those metrics in place, you're you're gonna be kinda eyeballing everything and just going on a hunch. Whereas if you know what your baseline is for normal CPU on Server Australia 76, then you can actually be very responsive and very scientific about how you do respond in time. Right? Changing a setting. If you don't have a baseline and you decide, oh, you know, I'm going to change max degrees of parallelism and cost threshold for parallelism, but you haven't created a baseline yet, you'll never know if that actually provided a benefit or not. You have to have that baseline to see and to compare anytime you make significant changes. Health checks also enable you to do capacity planning. Many times the first, option that DBAs and IT managers take is they choose, Oh, well, I'm getting complaints about performance. Our server has another slot or two slots for RAM. So let's throw some more RAM into that SQL Server. The problem is you have no idea whether memory constraints were the issue. And so by having a baseline, you can see how likely it is before you ever take that technique and take that approach to fix the problem to see if it will fix the problem. Many times, hardware at a problem won't fix it because the problem is either Suctacular SQL, SQL that sucks so bad it just consumes however much resources you have, Or SQL databases that are designed very poorly and will never give you good performance. And then also with your health checks, don't forget Query Store. I find a lot of folks forget that this is a built in feature now in more recent versions of SQL Server, and it gives you great information about queries that have regressed or aren't performing well. And then there's also a couple kind of community tools that are really strongly endorsed. SP Who is Active? Written by a friend of mine named Adam Mechanic. The First Responders Toolkit written by another friend, Brent Ozar, and his volunteers. And a person named Eric Darling also has some really great SP human and looking at CPU and some of those kinds of features and how they're behaving in your SQL servers. So let me show you a couple of capabilities with regards to streamlining your health checks. And one of the things that's most essential in order to have very conclusive and informative health checks is we absolutely need a baseline. And this is a set of features that is built into both DPA and SQL Sentry. They work a little differently. In DPA, what we do is we have this baseline dropdown that's provided in every even new instances that you've never touched before. You can compare to the previous period, which is the thirty minute time period we are looking at on the graphs right now. We can look at the previous day or the previous week and compare. When we do that, we're currently looking at the average here. And you can see a kind of a line that follows the average. I actually prefer to see the range, which shows us the upper and lower range of what is normal for this server. So we can see that here, if we look at the database IO and we're looking at log flushes, we can see that they are actually normally quite a bit higher for that particular period of time. And they drop down quite low later on this thirty minute time period that we see in this window here. So now let's look at a larger time period and I'm going to say 8AM. And so this is pretty much for the last twenty six hours. And we can say, all right, I need a new baseline. It turns out that this time period right here, which is fairly active, is a better representation for what is normal on our system. So we could say create baseline. We give it a name, a description, and what SQL Sentry does behind the scenes is it looks at all of the performance data. It collects the min and the max of those, the high and low watermark, as well as the average. And we can see what the standard deviation of our bell curve is. So in SQL Sentry, we can even have performance alerts that say not when CPU is 80% or higher, we could say when it's one standard deviation off of the average. That would also let us know if CPU was super low on the left side of the bell curve, because that's very abnormal too. Normally we're at 30 or 40% CPU consumption. So it gets both of those conditions in a very effective way. Now, if you never built your own, you might not know what PerfMon counters you would want to use to build a baseline. And we have those all available for you here. And you can look at the documentation too. Just go to the help documentation for SQL Sentry and look up creating custom baselines and you will see all of that information for you. It's also something you can freely query from the database as well. Okay, now that's on SQL Sentry. To streamline that on DPA is even easier. So let me switch over to DPA and show you that example. So DPA has some capabilities called tuning advisors, and these are basically health checks that are built right in and they're so powerful. So if we go to home on DPA, we can see overall all the different, broad metrics of our whole data estate. We can see, for example, we've got DB2 or SQL Server. We have different alarms and so forth going off. We can check those here. So if we look at a particular SQL Server and drilling down, We see the wait states that happened for a given day and over time. We can see that we have some blocking and so forth happening. And we also see that there's a red line under Tuning and Resources. This Tuning Advisor is basically a continuously running health check and it checks three kinds of things on your server. It checks indexes, it checks queries, and it checks tables. So maybe a table is having some problems because it has a strange data type in there. Maybe your SQL queries are performing poorly. This particular query right here, 98% of the wait time for memory and CPU is caused by this query. Over here, we see on our different tables and so forth that we have some indexes that would greatly benefit processing all the transactions that hit that dbo. Orders table. And so we can see that there's two queries that have accumulated an enormous amount of the weight statistics against that. And if we were to create this non clustered index it recommends, it would save us a lot of time. Over two hours and thirty eight minutes spent waiting, creating this index will help alleviate those weights. Same way, if we look at this query, we can say, Okay, well, I know that the query advisor will give me information if there's an obvious fix. Maybe we've got, say, a user defined function that's problematic. Another thing that we can use it for too is we can use our AI Query Assist in which we send a package of information to our LLM. The LLM will process through that query, explain the issues that the query has, and propose a better performing alternative. So there are some really amazing health checks built into the tool as well. So with that, I've shown you several different set of features and characteristics of our tools, as well as free and open source tools that you could use in a Microsoft SQL Server environment. There are also tools like that for Postgres, Oracle, MySQL. And if you have any recommendations that regard for free and public tools, I'd love to hear your response back on those. We've got a little bit of time for Q and A. So if you'd like to, feel free to ask your questions now and I will try to resolve those. And as we wait for those to come in, I want to express gratitude and say thank you so much for spending your time with me. I hope this was valuable to you and you learned. And if that's the case, please pay it forward. Share that knowledge with others. And with that, I hope to talk to you soon. Thank you.