The Code Works But The Code's Wrong
The most dangerous bug is not in the code but in the mind of the person writing the code. You can have code that works which hides a bug that exists in the design. This kind of bug can lay dormant for years. It’s only revealed when a benign change causes serious problems. This kind of bug can be prevented by treating the design of software as a thing apart and making sure that the code reflects the design as closely as possible. What follows is a story of how some code that worked turned out to be wrong.
- The code was hard to decode.
- The code had a bug
- The bug was fixed
- The code was made easier to decode
- The new code started making duplicate records
- The code was reverted
- Accidental Grace
- The Error of Modular Reasoning
- So what?
The code was hard to decode.
It handled the confirmation of subscription renewals on the app stores(Apple/Google Play/etc…). When a customer first subscribes, we create a subscriber record which had the renewal date. When the renewal date came up we create a new subscriber record for that month, put it in a Pending
state and then send a request to the app store to see if the customer renewed their subscription.
A subscriber was encoded in the following data structure:
type Subscriber = {
state: 'Active' | 'Inactive' | 'Pending',
invoice_month: number,
invoice_year: number,
// The above two fields answer questions like:
// How many active subscribers did you have in January 2023?
renewal_date: Date | nil,
type: 'Regular' | 'FreeTrial'
// Other fields that don't matter for this story are omitted
}
The specification for renewing a subscription went like this
RenewalService
- Given a
Subscriber
in theActive
state with arenewal_date
in the past- Create a new
Subscriber
with aPending
state and theinvoice_month
andinvoice_year
of the current month and year. - Send a request to the app store get the state of the subscriber on their end
- If the response comes back saying the subscriber is current
- Update the
Subscriber
record toActive
- Update the
renewal_date
to match theexpiration_date
in the app store response
- Update the
- If the response comes back saying the subscriber is not current
- Update the
state
toInactive
- Update the
- If the response comes back saying the subscriber is current
- Create a new
but the code did more.
The code handled free trial conversions. Converting a free trial is like a renewal, but not quite. A free trial has a date it ends like a regular renewal(we used renewal_date
for this) but that date is not usually the next month. A free trial often ends within the same month it was started.
Adding the specification for the free trial conversions inside of RenewalService
leads to following:
RenewalService
- Given a
Subscriber
in theActive
state with arenewal_date
in the past- If the
invoice_month
andinvoice_year
are the current month and year This is the check for free trial ending within the same month it started :- Change the
state
of theSubscriber
toPending
while keeping the sameinvoice_month
andinvoice_year
- Send a request to the app store to find out if the customer converted
- If the response comes back saying the subscriber is current
- Update the
state
toActive
- Update the
type
toRegular
- Update the
renewal_date
to match theexpiration_date
in the app store response
- Update the
- If the response comes back saying the subscriber is not current
- Update the
state
toInactive
- Update the
- If the response comes back saying the subscriber is current
- Send a request to the app store to find out if the customer converted
- Change the
- If the
invoice_month
andinvoice_year
are not the current month and year This means it's a regular renewal or a free trial that ended the month after it started :- Create a new
Subscriber
with aPending
state and theinvoice_month
andinvoice_year
of the current month and year. - Send a request to the app store to get the state of the subscriber on their end
- If the response comes back saying the subscriber is current
- Update the
state
toActive
- Update the
renewal_date
to match theexpiration_date
in the app store response
- Update the
- If the response comes back saying the subscriber is not current
- Update the
state
toInactive
- Update the
- If the response comes back saying the subscriber is current
- Create a new
- If the
It also did a couple more things but they don’t factor into the rest of the story. It restored subscriptions(again something like a renewal, but not quite). Under certain situations that happened with some other code somewhere else in the system it would change a subscriber from monthly to yearly or the reverse. And it created or updated records specific to each app store.
The code had a bug
Some renewals were skipping months. For example, a subscription started in July, that was suppose to get renewed for August, would come up as being renewed for September. The code was costing money.
The bug was fixed
This is not about the bug fix, this is about what was done as part of the bug fix and what happened after.
The code was made easier to decode
I’m a solid dev so I single-responsiblitied the code. I created a FreeTrialConversionService
, SubscriberRenewalService
, and SubscriberRestorationService
.
The specifications for the FreeTrialConversionService
and the SubscriberRenewalService
are outlined below. You’ll notice that it is just the RenewalService
broken up into two separate services. The tests were unchanged and they all passed.
FreeTrialConversionService
- Given a
Subscriber
in theActive
state with arenewal_date
in the past and atype
ofFreeTrial
- If the
invoice_month
andinvoice_year
are the current month and year- Change the
state
of theSubscriber
toPending
but keep the sameinvoice_month
andinvoice_year
. - Send a request to the app store to find out if the customer converted
- If the response comes back saying the subscriber is current
- Update the
state
toActive
- Update the
type
toRegular
- Update the
renewal_date
to match theexpiration_date
in the app store response
- Update the
- If the response comes back saying the subscriber is not current
- Update the state to
Inactive
- Update the state to
- If the response comes back saying the subscriber is current
- Change the
- If the
invoice_month
andinvoice_year
are not the current month and year- Pass the
Subscriber
to theSubscriberRenewalService
- Pass the
- If the
SubscriberRenewalService
- Given a
Subscriber
in theActive
state with arenewal_date
in the past- Create a new
Subscriber
with aPending
state and theinvoice_month
andinvoice_year
tied to therenewal_date
This is what fixed the bug. The background job we use to run the renewal process could get backed up at the end of the month and spill over into the following month. This lead to subscribers who were actually renewed in March showing up as being renewed in April. Theirinvoice_month
value should have been March but it was April. This was the source of skipped months. . - Send a request to app store get the state of the subscriber on their end
- If the response comes back as saying the subscriber is current
- Update the
Subscriber
record toActive
- Update the
renewal_date
to match theexpiration_date
in the app store response
- Update the
- If response comes back saying the subscriber is not current
- Update the
state
toInactive
- Update the
- If the response comes back as saying the subscriber is current
- Create a new
No behavior was changed.
The new code started making duplicate records
It was tens of thousands when someone on the support team brought it up on Slack. It was hundreds of thousands by the end of the day. I could not figure out how I fucked up so bad. This is much worse than the original bug. The duplicate records were breaking a lot more functionality across the system and costing a lot more money. But as far as I knew the new code did the same thing.
The code was reverted
To stop the bleeding and figure out what happened.
Accidental Grace
Let’s talk about the app store response. The responses from the app store were parsed and made to conform to a single interface
interface AppStoreResponse {
active: boolean,
expiration_date: Date,
}
The thing that really mattered was the expiration_date
. An expiration_date
in the past meant that the subscriber’s access to the product had expired so active
would be false
.
One of the app stores had a different rule.
This app store required us to offer a grace period. This meant that active
would be true
while the expiration_date
was in the past.
This meant that the app store response had this specification:
AppStoreResponse
- Given a request for subscriber information
- If the subscriber is current
active
istrue
expiration_date
is in the future
- If the subscriber is in a grace period
active
istrue
expiration_date
is in the past
- If the subscriber is not current
active
isfalse
expiration_date
is in the past.
- If the subscriber is current
The specifications for renewing a subscriber and the app store response never matched. Subscribers who were in a grace periods were continually being picked up by the query that picked up subscribers up for renewal. In the initial implementation, since this line
If the
invoice_month
andinvoice_year
are the current month and year.
was true they were treated like free trial conversions. The code “worked”
I put the word, worked, in quotes above because the code did a half-assed job. Grace periods were never dealt with so we did not keep track of them or have specific logic to deal with them. This left a significant hole in how we dealt with subscribers. We didn’t even know to ask certain questions or do certain things, for example: How many subscribers are currently in a grace period? Is there a huge jump in subscribers in grace periods after the holidays? We also could have only checked the status of subscribers in a grace period after the grace period was over which would avoid unnecessary requests to the app stores.
by accident. The new implementation revealed the mismatch in specifications. The SubscriberRenewalService
, made a new subscriber record each time. And each the record ended up in an Active
state with a renewal_date
in the past which were immediately picked up to go back into the SubscriberRenewalService
. Which is what happened in the initial implementation(RenewalService
) with the exception that new Subscriber
records were not created.
The Error of Modular Reasoning
In his blog post, The Three Levels of Software: Why Code that Never goes Wrong Can Still Be Wrong James Koppel summarizes the three levels of software as:
- Level 1: Runtime. The runtime level deals with specific values and a specific environment from a single execution of the program. A lot of debugging is done at the runtime level.
- Level 2: Concrete Implementation/Code. At the level of the concrete implementation, we think about what the current implementation could do when given arbitrary inputs and an arbitrary environment. Behaviors that cannot happen are not considered, even if it requires global reasoning to rule them out. A lot of implementation work is done at the code level.
- Level 3: Design/Logic: At the level of logic, we consider the abstract specification of each unit of a program. When using other units, we only consider the guarantees made by the spec, and assume they may be replaced at any time with a different implementation. Many programs which are correct when viewed at the concrete implementation level are not correct when viewed at the logical level, because they rely on behavior which is not guaranteed to hold in all future versions. We call this an error of modular reasoning, because functions with such errors lack a desirable property: the ability to argue that the function is correct only from the function’s code and from the contracts of the function’s dependencies, without need to even glance at the dependencies. Most software design is done at the level of logic.
The code worked at the runtime level. The code “worked” at the concrete level. The code did not work at the design level.
He puts it another way:
A program is wrong if the reasoning for why it should be correct is flawed.
You’ll notice that I mentioned the code but never show any. This is because this is isn’t really about a specific implementation(“the code”). It’s about the reasoning that produced the code. The code was written with the reasoning that it only had to deal with regular subscribers and free trials. It could have been implemented in many different ways. It would been wrong no matter how it was written.
The new implementation revealed the error of modular reasoning in the initial implementation.
So what?
Talk is cheap. Show me the code.
I started my programming career believing that code was king but code is getting cheaper to produce everyday. A talk with the person who initially wrote the code would have saved me from a lot of confusion, frustration, and about five hundred thousand duplicate records. So actually,
Code is cheap. Let’s talk about it.
The code comes at the end of a long series of steps. In rough outline the steps are, the request, the requirements, turning requirements to acceptance criteria, coming up with a plan, executing that plan. I’ve merged the last two parts in my programming career. I’ve started executing, expecting to come up with a plan as I go aided with the scaffolding of writing tests first. It’s served me reasonably well but in the last couple of years I’ve realized that the “coming up with a plan” part is its own process. The design of software is a thing apart.
What does it mean that the design of software is a thing apart? It means that the information about the design of piece of a software is not in the code. It means that there are assumptions the programmer makes about the inputs to the code and the outputs of the code that are not in the code. The initial implementation of the RenewalService
always expected the AppStoreResponse
to have an expiration_date
in the future if the the subscription was still active. This was nowhere in the code. It was in the mind of the person writing the code.
After you have a design, the next thing to do is to make sure the code reflects the design as much as possible
When it can’t be in the code, it can be in a comment.
. If the initial implementation of RenewalService
threw an error if the AppStoreResponse
returned an active subscription with an expiration_date
in the past, the design flaw would have been discovered much sooner.
I urge you to spend time designing your program and then make sure your code reflects the design.
Published: 2023-01-13
Last Edited: 2023-01-29