Sunday, June 6, 2010

Planning a Maintenance Window

Do you feel the heat, when you sense a maintenance window zooming your way? I don't think it is something unnatural. sometimes, I do too. Maintenance is inevitable. Hope you all will agree that - 'the root cause of unplanned outage is an environment that is not up-to-date'. With a recent experience of mine, I thought, I must share how better this can be performed with some simple precautions.

I always feel, while choosing a production maintenance window, whether it's a Data Center Network maintenance , a DB migration activity, or anything else - which directly or indirectly affects your customer, should be meticulously planned well ahead. I personally feel it must be planed some off-hours and requires challenging coordination between different activities and often different departmental personnel. A off-hour window, always helps to buy some more time, which gives a 'free cum extra-more' feeling!!. If you have multiple offices, that too geographically apart, do try to draw a vector diagram, and take out the intersection of the time,where no one works.

Worries

Maintenance is an activity which is aimed to mitigate future risks, we must also be aware of some delicate long lingering issue which may in turn present us, some other threat which hampers the successful completion of this current activity. The idea would be, draw a flow chart and expect everything to fail, and believe me, you will cover most of the points which might need a time consuming attn. and additional resource in the course of the event.

  1. Communication, clear communication- well in advance. Do loop in all the stakeholders, and a timely gentle reminder, until you actually start the same.
  2. Do test your backup/fail-over, that it actually works.
  3. Do send a reminder email, just before it starts. (at least 15-30 min.)
  4. Minimize the downtime as much as possible. Do NOT involve tired resource(human) to drive the same.
  5. Make sure, your maintenance plan had included all the pre/post activities and provide an estimated time for each step (allow more time than needed, and add risk time too). This will allow you to estimate the time and set the expectations.
  6. Test,test and test, release a completion email, only when your are sure, things are working.

Once we had a Network maintenance, Mail was pretty poetic, very well planned, lot of points, perfect backup strategy. But point which was NOT highlighted was that, normal phone lines will also be shut in this window(an i missed that point for this worse exp.), and there was no mentioned about who is driving this, and if any issue, who is the SPOC to be called up and what is the war room phone line no. if any!! I was working on a DB migration and suddenly got kicked out from meeting place, VPN, and everything one after another. I saw their communication, well in advance, but backup plan to connect to some other VPN did not work for me, and there is no contact info. whom to contact for this. Since VPN was down, no way i could see office directory.. and saw a total blackout for some good amount of time. Point here is that of "contact information" and being responsive and quick follow up.

Minimum Information needs to be Provided in an email, for the activity , to let the world know, can be:
  1. A brief about ‘What’ and ‘why’ this activity
  2. BU Detail
  3. Start Time(With TimeZone)
  4. End Time(With TimeZone)
  5. Related Ticket No.(if any)
  6. Maintenance Contact person details:
  7. Expected behavior/impact
  8. Fail over detail (For detail you may like to provide in a separate URL).
But I strongly feel that maintenance windows should not be that painful process, especially nowadays with advanced cluster/cloud capabilities, improved storage availability and resilient network architecture. Organizations can now plan a maintenance window to be a relatively safe process with minimal risk with a proactive and clear communication.


Cheers!

DEBU

No comments:

Post a Comment

RCA - Root Cause Analysis

An important step in finding the root causes of issues or occurrences that happen within a system or organization is root cause analysis (RC...