, ,

This is regarding failure of an IT application system. I have came across some situation in my carrier (shameless to admit) when system failed, server broke down and SLA shattered. I hope might be lot of you also have came across such situation.

When system goes down unexpectedly, business application is not available, and people responsible are summoned with the logs, along with what they have done and not done, lot of shouting, exchange of feedback between the customer and the service provider, sometime back line of the OEMs etc…Most of us faced such situations (except those who are lucky enough) and these incidents remain like a milestone in our memory.

I feel that the life cycle of system in production phase (after it is handed over to support team by the project team) is clearly distinguished into three phase – The smooth running phase, the warning phase and the breakdown or hung phase. This is more or less similar to a traffic signal operation on the crossing of a road, the green signal phase, the yellow signal phase and the red signal phase. After a lot of observation in my own incidents as well as listening from my peers, I feel that the warning phase or the yellow signal phase is a most significant phase. Every system landscape now a days throw warning message and error message for a considerable period of time, giving a chance of rectification online and then it hungs or shuts down. This is very similar like slowing don the car gradually before stooping at red signal. This is the phase where can actually measure the effectiveness of the design, effectiveness of the support team and overall the IT process.

Any enterprise application comes into production phase depending upon three basic design, the soft design/ logic design which essentially consists of program & configuration, the hard design /infrastructure design which is mainly responsible for service availability, and the security design which is acting in between the user and the available service as a gatekeeper. All these three designs goes thru acid test during the yellow signal phase / warning phase.  Like you may or may not have the workaround solution to avoid a system reboot or shutdown for the possible warning message you face in the system landscape you support. Or it is possible that your team member is not very clear about the responsibility on occurrence of system warning message or error message.

A support team must have the process appropriate for the yellow phase. With the role of everyone clearly defined, and obviously ECAB (emergency change advisory board) members must be part of this process. This process must have answers to the following questions..

  1. Whether all the possible warning and error message are classified?
  2. Does workaround solution exists which can be applied for avoiding a system shutdown or reboot?
    • If yes then
      1. Whether the possible warning messages are bench-marked and threshold in terms of value and frequency defined?
      2. Whether the alerts are configured in the automated or manual monitoring system?
      3. Whether the alert are assigned to a person/team?
      4. Whether the person/team responsible has the checklist for attending the alert?
      5. Whether the person/team is trained to perform the activity for the checklist?
    • Else
      1. Whether non existence of a workaround is mentioned in and highlighted in design documents?
      2. Whether the risk associated is agreed upon by the business ?
      3. Whether SLA calculation is done on the basis of this?