Saturday, June 23, 2007

On Call, Day Five

All day Friday we worked on a complex issue that a department had reported the day before. We thought the issue was resolved, but the site manager reported back that it was not. We were given two contact names, as well as the site manager. Both of the contacts could help us in tracking down the problem.

We had enlisted the help of a tech in Manilla (actually one of a very few people who could help...the primary support is in Europe and most of the techs are on Midsummer holiday for a month) who was attentive, but not overly helpful. He was available, which at this point was pretty helpful in its own right. The problem was confusing because end user "A" had originally reported that a problem that had cropped up two weeks ago had surfaced again. The reason that was troubling is that this problem was solved and had been running fine for two weeks and now we have one person saying it is back. The site manager had reported that "the whole department" was down and that this had to be made a hight priority.

During this time, end user "B" is designated as a contact for us. We call her and she says the issue has been stable since 11:00 AM. This contradicts what the site manager says and we are now a bit confused. 5 PM rolls around and I am on call. Since this issue has been running all day, I have to follow up with it. The tech in Manilla and end user "B" both agree that the system has been stable.

About 9 PM, I call end user "B" (since she is working overtime to get caught up on some data entry) to see how things are going. She tells me that the system has been stable for 10+ hours. I check with the tech in Manilla and he confirms that the last hiccup came at around 11:00 (our time). I start asking end user "B" some questions, telling her what the site manager and end user "A" have reported and she begins to tell me that end user "A" has a problem, but it is not the same one that end user "B" was having. Aha...two issues were married into one ticket, but only the one problem was understood. End user "B" continues to tell me that the problem end user "A" is having could very well be a operator error...meaning she is entering invalid data that is causing the problem. However, "B" cannot verify this because "A" is not in the office and will not be back until Monday. "B" tells me that she will get with "A" on Monday to try to fix the error and until then, the case can be put into pending. I thanked her and hung up.

I then put on my thinking cap and retraced the issue from the minute I got the original call until I put it in pending a few minutes ago:

End user "A" has a problem when she enters data in $Application. The app hangs, just like it did two weeks ago when $Application was down. She tells me the same problem has occurred. The team that worked on it the last time is called to work on it this time. "A" calls this in at 3 PM and leaves at 5 PM. Tech is willing to work on it during the night but wants "A" available for testing, she refuses so the ticket sits in pending overnight.

The next day site manager hears about it and assumes it is the same issue as before (which has been fixed, but the fix is a workaround until $Application vendor can make a patch). The patch that is in place is a little quirky and causes the system to hang for short periods while it kicks off. This cannot be helped until $vendor updates patch. This quirk is not told to us, we figure it out on our own. About the time end user "A" has this problem again, end user "B" senses the quirk of the patch, but not knowing about the patch, she assumes the system is down. She tells the site manager and "all hell breaks loose". I call the site manager and discuss things with her and she somewhat rude to me, telling me that we need to fix the issue. I tell her that a permanent fix has to be provided by $vendor but she just poo-poo's it off, telling me that it is critical and we (my company) needs fix it. We hang up, neither in a great mood.

Site manager calls back (after talking to end user "A") saying the system has been up and down all afternoon. Our contact in Manilla does not see that. The system has been up and running for 4+ hours at this time and all transactions are processing. Site manager tells us that the entire department has experienced the outage so we must be wrong. This goes on all day until I go "on call".

After I wade through all the issues and the problems, I e-mail everyone involved (there is no way I am calling the site manager at 10:30 PM after the way she reacted to my 8:00 PM call), CC'ing a manager that had been in the loop earlier. I explain my findings and why I believe what I believe. I also mention how helpful end user "B" has been and that since she is my contact and she is not having a problem, I am not going to pursue this case until end user "B" can verify what "A" is reporting. Right now only one person is having this problem so it is looking very much like operator error. I would find it hilarious if all the extra work that is being created by this "error" was actually caused by end user "A" doing something incorrectly. I bet the site manager will have an aneurysm!

No comments: