Help your teams find random bugs

random bugs

The customer is shouting because “it’s not working again”.

The account manager is shouting because “the customer is shouting”. Customer service has opened yet another clone ticket and increased the priority. Development can’t reproduce and asks to “do more troubleshooting before we can fix anything”. QA has “already spent endless hours on this” and more effort will delay the next release.

When this happens, people usually talk about a “random” or “ghost” bug.

As no one owns the problem, this usually ends up on the product manager’s desk.

Definition of random bugs, and other descriptions

The definition of random or ghost bug is unclear by nature and can be summarized to these two aspects:

– Someone has had some problem that is not there anymore.

– This has been reported several times, so it’s unlikely to be a glitch of the universe, but it cannot be nailed.

I’ve also heard people using terms such as instability and unreliability when random things seem to happen.

From the customer viewpoint, we can hear things like it’s not working or it’s always broken – and that’s usually the message that reaches the CEO, and then everyone gets really nervous.

A random bug is often a real symptom of something

In 99% of cases I’ve experienced, there was a real and very fixable problem eventually identified, and the customer was right.

The correct attitude, in case of a “ghost bug” situation, is not to say “I can’t reproduce, therefore it does not exist” but “It exists, therefore I must find how to reproduce it”.

Here are a few real-life cases explaining what could be happening. If you are facing a ghost bug situation, read through to get some ideas of what could be going wrong here. I keep updating this list when I find a new root cause.

Possible causes of random bugs

Specific OS/browser configuration

This might sound trivial, but I’ve seen it too many times to discard it. Customer is having a bug on IE/Safari/Chrome and the developers cannot reproduce… simply because they like to use Firefox with Firebug and other troubleshooting tools. Basic step – get as close as possible to the customer environment. Sometimes this means actually going on site.

Severe network bottleneck or latency

In this scenario, the customer is on a very slow network and the page does not behave correctly. A typical scenario is where a page loads and renders graphically, but not all scripts have finished loading. The customer will try to click around (open calendars, dropdown lists) and these won’t work. This behaviour can be very random, as everything might work nicely once and not work anymore 5 minutes later, depending on proxy/cache configuration of all servers in the chain. In some cases part of a request might even time out and the script/resource never make it to the customer browser.

Note that this might be also happening in a corporate environment if the customer is sitting behind corporate firewalls adding latency or other restrictions.

A possible solution we implemented to help troubleshooting is to create a small speed test page, with a script loading an image 10000 times, and measuring the total time. Customer service could ask the customer to load the page and write down the total time once finished, to get a sense of the client’s network quality.

Your servers are close to saturation

Very similar effects to the ones described above can be observed in case the server itself is not able to answer quickly or to all requests. The random behaviour of ghost bugs is explained by the fact that server traffic comes and goes, so one user might see everything as expected and the next user will have problems. In case of server saturation your operation teams should have alerts popping up everywhere (memory leakage, packet loss, high CPU… ).

Remember to double check with Operations.

Different regional settings on user’s local machine

In one of our projects, some customers were reporting getting errors upon submitting a simple form. The submission was working perfectly for everyone else.

It turned out all the users experiencing the problem where using a specific English regional setting (in Windows, this is typically found under Control Panel>Regional and Language Options). These settings were different than those used by the development/QA teams.

This can create problems in terms of date or currency formats. In our scenarios, a currency value was entered as “100”, transformed into “100,00” (with a comma) by the OS, and triggered an error as the server was expecting a numeric value like “100.00” (with a period).

Once identified, it was very easy to reproduce just by changing a QA machine’s regional settings to the same ones as the customer.

Session expiration

Most applications manage a session to hold user navigation data. For example if a user is entering a departure city on a travel site, it’s nice to remember that entry, and prefill the departure city for the rest of the session. Another example is storing local filtering preferences so that when the user hits the “back” browser button, s/he will find the same filtered state and not have to restart.

In some cases the session might expire abruptly, therefore losing the precedent state or other important data, and therefore generating a wrong behaviour, different than what expected in a normal session. Abrupt session expirations are not normal, and can trigger a “random” pattern.

How could a session expire?

1. timeout – user went to the bathroom, picked up a call, grabbed a coffee… whatever. Most web sites only keep a session for 20-30 minutes.

2. conflicts between different browser tabs or windows accessing the same web site at the same time.

3. server problem: a session object requires memory, the longer the session and the number of concurrent users, the more memory is allocated. It can happen that memory management is not perfect, and some sessions are deleted or expired sooner than expected – especially in case of peak traffic.

4. load balancer issue (see below).

Dynamic inventory

This cause also sounds trivial, but it happens all the time.

Inventory can be number of books in stock, available hotel rooms on a specific date, available flight seats at a specific fare. It can be extremely difficult to reproduce a bug in this environment, because the overall system status might be impossible to reproduce.

If for some reason there was a bug when buying book number 100 and leaving 99 available… by the time someone in CS or tech tries to reproduce there might just be 40 books available, or 300, in any case not the exact scenario.

Currency fluctuations

If your web site is managing multiple currencies, you could have discrepancies between amounts displayed to different users if your exchange rates are not updated at the same time on all servers or applications.

You have a load balancer architecture… probably working well

In a typical system architecture, browser queries are first received by a load balancer that will then dispatch them to one of multiple servers in a farm. This is a good idea, because if one server goes down or is slowed down with some heavy processing, the load balancer will simply dispatch traffic to the other servers. Another advantage is that when you have to deploy a new release you can do it server by server, therefore not stopping service to all of your customers.

A-ha. So now we can have a situation where user 1 is dispatched to server A running release 34.12, which is not the same than user 2 dispatched to server B running release 33.99.

Caching

This is the Mother of all problems when trying to find ghost bugs. You have released a new version of a page or script, but some of the users still have the old version in their browser’s cache.

They might be loading a new page, but then use an old cached javascript. Or they might see a new page with a mix of old and new icons.

Caching is tricky to manage, as it’s impossible to control where your elements could be cached: they could be cached by the user browser, by the corporate proxy, by the ISP proxy, by a content distribution network like Akamai, etc. Yes there are rules and tags to specify how each element is supposed to be cached, but… have you implemented tagging on all elements? Have you not forgotten CSS files? Is all of the Internet (routers, firewalls, proxys) reading and respecting the caching parameters you’ve specified?

So far I’ve only heard of one fool-proof solution, which consists of appending a version and/or datestamp to any object name to ensure it is refreshed in all caches. So instead of a generic “logo.gif” icon or “calendar.js” script, you would have “logo_101225v1.jpg” or “calendar_v13.js”.

Data fields with specific values, characters or lengths

In one of my projects we had complaints from customer service teams that valid user credit cards were not accepted by our system once in a while (random behaviour). We asked customer service to track down all occurrencies and related customer details. For some reasons about 1 out of 10 cards was rejected by our payment gateway provider, with error “wrong CVC code”. But obviously, the CVC code entered was right.

The issue was that the CVC data type was declared as numeric. Any CVC code starting with 0 was simply stored and sent as the 2 rightmost digits (code 076 was just sent as 76). 1 out of 10 cards were therefore failing validation in what seemed to be a random pattern.

Changing the field type to alphanumeric solved the problem.

A similar situtation can happen if an input value contains special characters (é, à, ü, …) that somehow are not processed/stored correctly across all modules. You’ll then experience “random” issues, in fact not random at all once you realize it only happens on a few specific records.

Another similar situation was created by an input field truncating after a given length. The end user could type in all of the string (like a long email address or a long last name), but in those few cases where the length was too high, the string got somehow truncated later on. We’ve had this case with email addresses being used as usernames: profile creation was successful, but trying to logout and login again was failing randomly: in fact it was only failing for a few very long emails.

Unsavvy customer + unsavvy customer support

I should not even mention this root cause, because this is the preferred excuse for developers to avoid spending time on a bug. But unfortunately, it can happen.

Example. User is at the end of a purchase process, and a friendly message tells them “To complete your purchase, please click the Pay Now button below”. But the “Pay now” button might be slightly different than expected: it might actually be labelled “Pay now and confirm”; or it might be a link instead of a button (this scenario is very frequent in case of a translated web sites, where user messaging and buttons have been translated by different resources and not perfectly matching). Whatever the glitch, the result is that an unsavvy user might call support saying “there is no Pay Now button”. An unsavvy rep will log a ticket with topic “Missing Pay Now button”, obviously “high priority/urgent” because of potential missed sales, and the teams end up wasting hours trying to reproduce on all possible environments.

The only measure against this type of situations is good troubleshooting skills in customer support. There are tons of trainings and methods to follow to ensure thorough troubleshooting, but in my experience it’s often a personal soft skill that some people do have, and some don’t.

Bugs in 3rd party envoronment surrounding using your API

If you provide an API and your partners are handling the front end, it can happen that your teams receive calls for phantom bugs that in fact are due to bugs in the way the API is called. Typically, the end user would first call your partner’s support desks, they will run some troubleshooting routines and decide that the bug is related to your system, therefore calling or logging a ticket at your support desk and claiming that your system behaves erratically.

We’ve had the case where partner calls into our APIs were getting truncated to some maximum byte size, somewhere in the network stack. This was dependent upon the length of the field values, therefore causing “random bugs” in the sense that only a few, very long calls were failing.

Logs and trace analysis should provide you with clear answers here.

Have you experienced other causes for random bugs? Let me know and we’ll add to the list.