In the article "Learning From Your Bugs," I wrote about how I tracked some of the most interesting bugs I've encountered. Recently, I reviewed all 194 of my entries to see what lessons I could learn from them.
Below are the most important lessons I've summarized, covering three aspects: coding, testing, and debugging.
Below are some issues I've experienced that can lead to challenging bugs:
1. Event order. When dealing with events, asking the following questions can be very effective: Can events arrive in a different order? What happens if we don't receive this event? What happens if this event occurs twice in a row? Even if it doesn't usually happen, a bug in another part of the system (or interacting systems) may cause the event to occur.
2. Too early. This is a special case of "Event order" above, but it does cause some troublesome bugs, so I will take it out separately for explanation. For example, if signaling messages are received prematurely before the configuration and startup procedures are completed, many strange behaviors may occur. Another example: a connection is marked as down before it is placed in the idle list. When debugging such problems, we always assume that the connection is set to down when it is in the idle list (but why not put it out of the list at that time?). This is our lack of thinking, not taking into account that sometimes things will happen too early.
3. Silent failure. Some of the most difficult bugs to track are caused in part by code that silently fails and expands rather than throws errors. For example, a system call (such as bind) that does not check the code but returns an error. Another example: the parsing code only returns rather than throws an error when it encounters an error element. Calls that last for some time in the error state make debugging more difficult. It is better to return an error once a fault is detected.
4. If. If-statements with several conditions, if (a or b), especially when chained, if (x) else if (y), have caused many bugs for me. Even though if-statements are conceptually simple, they are easy to get wrong when there are multiple conditions to keep track of. These days I try to rewrite the code to be simpler to avoid having to deal with complicated if-statements.
5. Else. Some bugs are caused by not correctly considering what will happen if the condition is false. In almost all cases, there should be an else part to deal with every if statement. In addition, if you set variables in the branch of the if statement, you may also set variables in another branch. Related to this case is the case where the mark is set. It is not difficult to add only the conditions of the tag used for setting, but it is easy to forget to add the conditions when the tag should be reset again. Leaving a flag that is always set may lead to successive bugs later.
6. Change assumptions. Many of the most difficult bugs to prevent at first are caused by changing assumptions. For example, at the beginning, there might be only one customer event per day. So many codes are written under this assumption. But later, the design changed to allow multiple customer events every day. When this happens, it is difficult to change all the situations affected by the new design. It is not difficult to find all explicit dependencies on change, but it is difficult to find all implicit dependencies on the old design. For example, there may be a code to get all customer events on a given day. The implicit assumption is that the result set will never exceed the number of customers. I don't have a good strategy for this problem. If you have any, please give me some advice.
7. Logging. It is very important to visualize what the program does, especially when the logic is complex. Be sure to add enough (but not too much) logging so that you can explain why the program does this. If everything is normal, it doesn't matter, but if there is a problem, you will be very lucky to add these logs.
As a developer, I will not deal with a feature until I want to test it. At least, this means that every new or changed line of code has been executed at least once. In addition, both unit testing and functional testing are good, but not enough. New features must also be tested and explored in a production-like environment. Only in this way can I say that I have completed a feature. The following are some important lessons learned about testing from the bugs I've experienced:
8. Zero and null. If possible, make sure to always test with zero and null. For strings, this means testing the zero-length string and null string. Another example is to test the disconnection of a TCP connection before sending data to it. Testing without these combined methods is the first cause of bugs.
9. Add and delete. Typically, new features include the ability to add new configurations to the system - for example, a new configuration file for mobile number conversion. It is natural to test whether it can add new configuration files. However, I found that it is easy to forget to test whether deleting the configuration file is the same.
10. Error handling. Code that handles errors is often difficult to test. It is better to have automated tests that can check error handling code, but sometimes this is impossible. One trick I sometimes use is to temporarily modify the code to make the error-handling code work. The easiest way to do this is to reverse the if-statement -- for example, changing it from if error_count > 0 to if error_count == 0. Another example is misspelling database column names, which results in the expected error handling code running.
11. Random input. Usually, one way to expose bug tests is to use random input. For example, ASN. 1 decoding of H.323 protocol operates on binary data. By sending random bytes to decode, we found several bugs in the decoder. Another example is using test calls to generate scripts. At this time, call duration, answer delay, first-party hangup, etc. are randomly generated. These test scripts will expose many bugs, especially when an event occurs together, which will cause convergence interference.
12. Check the actions that should not occur. Usually, the test includes checking whether the expected action has occurred. But it is easy to overlook the opposite situation - forget to check whether the action that should not have happened has not happened.
13. Own tools. I created my own gadget to make testing easier. For example, when I work with the SIP protocol for VoIP, I write a small script that can reply with the title and value I want. This tool makes it easy to test many corner cases. Another example is a command line tool that can make API calls. By starting to gradually add the required small functions, I got some very useful tools. The advantage of writing tools by myself is that I get exactly what I want.
It is absolutely impossible to find all the bugs in the test. In one case, I changed the processing of number correlation. The number consists of two parts: the routing address prefix (usually unchanged) and the number dynamically allocated from 000 to 999. The problem is that when the correlation is found, the first number of dynamically allocated numbers will be deleted by mistake before being presented in the table. That is to say, 637 becomes 37. This means that it can work until 100. Therefore, the first 100 calls are normal, but the next 900 calls fail. Therefore, unless I can test more than 100 times before restarting (the fact is that I do not), I will not find this problem during testing.
14. Discussion. The debugging technique that helps me most is to discuss problems with colleagues. Usually, just explaining problems with colleagues will make me realize the crux of the problem. In addition, even if they are not very familiar with the problematic code, they can often come up with some good ideas. Discussions with colleagues are particularly effective when dealing with the most difficult bugs.
15. Pay close attention. Usually, if it takes a long time to debug a problem, it is often because I made wrong assumptions. For example, I think the problem occurs in a certain method, but the fact is that it never even reaches that method. Or, the exception thrown is not what I thought. Or, I think the latest version of the software is running, but it is actually an old version. Therefore, be sure to verify details, not assumptions. People are more likely to see what they want to see than the facts.
16. Recent changes. When something that used to work normally stops working, it is usually caused by something that has changed recently. In one case, the most recent change was just logging, but the error in the log caused a bigger problem. In order to find this regression more easily, it is beneficial to acknowledge that different submissions will lead to different changes, and clearly state that these changes will be beneficial.
17. Trust users. Sometimes, when users report problems, my instinctive response is, "This is impossible. They must have done something wrong.". But I learned not to respond in this way. More time, facts often prove that what they report is actually what happened. Therefore, these days, I began to accept the value of what they reported. Of course, I will still carefully check whether everything is set correctly and so on. I have seen many such situations, which made me understand that strange things happen because of unusual configurations or unexpected uses. My default assumption is that they are correct and the program is wrong.
18. Test the fix. If a fix for a bug is ready, it must be tested. First, run the code before repairing, and observe the bug. Then apply the fix and repeat the test case. So far, the error behavior should disappear. Following these steps can ensure that it is indeed a bug, and this fix can solve this problem. Simple but necessary.
Over the past 13 years, I have been tracking the most difficult bugs I have encountered, and many things have changed as a result. I have worked on small embedded systems, large telecommunications systems and web-based systems. I have used C++, Ruby, Java and Python. Several kinds of bugs encountered when working in C++have completely disappeared, such as stack overflow, memory corruption, string problems and some form of memory leakage.
Other problems, such as loop errors and corner cases, I see much less. However, this does not mean that there are no bugs. The lessons learned in this article are intended to help reduce bugs in the three stages of coding, testing, and debugging. If you have any useful technical methods to prevent and detect bugs, you are welcome to give us guidance.