Race Conditions and Semaphores in WordPress

I recently came across a case of race conditions in a WordPress site: two processes, running the same piece of code at the same time, causing unexpected results. In this post I will detail what went wrong, and how I am trying to resolve it.

First, some background. In The Netherlands, and many other countries, you need to have invoice numbers that are unique and sequential. If you have an invoice with invoice number 41, your next invoice must be number 42. You can’t skip a number, and you can’t have duplicates. Only when it’s the first invoice of the new year can you change numbering, but when you choose an invoice number, you must stick with the sequence (this is to allow invoice numbers to be prefixed by the year, so you can have invoice number “20180149” followed by “20190001”, but the next one must be “20190002”).

Enter a client with a webshop in WordPress. More specifically, WooCommerce. They have the PDF invoice plugin, which creates invoices with sequential numbering, and a payment processor that allows for a direct debit payment (i.e. once authorized, the webshop can create payments without client intervention, for easy monthly payments). Recently, we found out that it sometimes creates duplicate invoice numbers. What gives?

The problem: race conditions

When looking into the problem, we noticed that it only happened with the direct payments, which are processed in batches, and that the payment times were identical. This seemed quite clear to me: we were having race conditions.

To illustrate what this means, let’s first describe the algorithm for generating these invoice numbers. Note that we have a field “invoice number in database”: this is a value that is stored in our database, keeping track of the latest invoice number so it’s easy to generate the next one. Since the memory is empty at the start of each run, we need to load it from the database each time we run our algorithm:

Time step Instruction Invoice number in memory Invoice number in database
1 Get order ? n
2 Get invoice number from database n n
3 Add 1 n + 1 n
4 Write invoice number to database n + 1 n + 1
5 Generate invoice n + 1 n + 1

Simple, right? Let’s now do it twice in a row, for two different invoices:

Time step Instruction Invoice number in memory Invoice number in database
1 Load order ? n
2 Get invoice number from database n n
3 Add 1 n + 1 n
4 Write invoice number to database n + 1 n + 1
5 Generate invoice with number n + 1 n + 1 n + 1
Some arbitrary amount of time passes
6 Load order ? n + 1
5 Get invoice number from database n + 1 n + 1
8 Add 1 n + 2 n + 1
9 Write invoice number to database n + 2 n + 2
10 Generate invoice with number n + 2 n + 2 n + 2

Still easy to follow, and everything seems to be correct: we start with the latest invoice being number n, and we end with the latest invoice number being n + 2, and our two invoices have different numbers. But now, let’s run this algorithm in two processes at the same time:

Time step Instruction
(process 1)
(process 1)
(process 2)
(process 2)
Invoice number in database
1 Load order A ? ? n
2 ? Load order B ? n
3 Get invoice number from database n ? n
4 n Get invoice number from database n n
5 Add 1 n + 1 n n
6 n + 1 Add 1 n + 1 n
5 Write invoice number to database n + 1 n + 1 n + 1
8 n + 1 Write invoice number to database n + 1 n + 1
9 Generate invoice with number n + 1 n + 1 n + 1 n + 1
10 n + 1 Generate invoice with number n + 1 n + 1 n + 1

If you look at each process individually, everything seems to be fine: they get the invoice number, add one, write that number to the database, and generate their invoice. Only when you look at the two processes combined, can you see that both order A and order B have received invoice number n + 1, because process 2 just happened to get the invoice number from the database right between the “get invoice number from database” and “write invoice number to database” instructions in process 1. In other words: the result of our program depends on the order in which the instructions from different processes are run. This is called a race condition.

A solution: semaphores

A railway semaphore The easiest way to make sure that these race conditions do not occur, is to not have multiple processes in parallel. If process 2 has to start after process 1 has finished, it can never occur that it >does something between two instructions of process 1. However, this means that all requests have to be handled by one processor on one core, and this does not scale well. It’s much better to allow the processes to run parallel as much as possible, and only have critical parts of the code run only in one thread at a time.

A way to set this up, is called a Semaphore. Named after the signals that told train drivers whether they could enter a specific section of rail (similar to traffic lights), semaphores in programming tell programs whether they can enter a specific piece of code. In its most basic form, it consists of two functions:

  1. Wait: tell the system that this process wants to enter the piece of code, and wait until we get access.
  2. Signal: after we had access and have executed the critical piece of code, signal that we have left so that the next process can enter it.

If we consider the train analogy: with the wait function, the train drives up until the semaphore and waits. The signalman in the signal box sees the train, and when it is safe for the train to enter the section of rail will it change the semaphore to let the train enter the section. Once the train leaves the section, it signals the signalman that it left the section (since this is the time of the steam locomotive, I imagine this would be with a small toot of the whistle) so that the next train can be allowed on that section of track.

How I built it into WordPress

Semaphores are often found in local environments, when multiple threads or processes are working closely together on a single (virtual) machine. This allows for fast communication and coordination, and it is often built into an operating system so that it can use knowledge of these semaphores to schedules processes accordingly.

Our case is a bit different, as there are several separate machines running the code, and there is very little possible to communicate between these machines. The only reliable common storage is the database, which works but has some drawbacks. The main one is that communication can only happen between the process and the database, another process cannot be notified through it. Processes therefore have to wait and keep checking for access. So, here is my implementation of the wait function:

  1. Get a number in line for the semaphore (uniqueness is enforced by the database)
  2. Keep checking whether our number is the lowest in the line. If we have the lowest number, we have access.

The signal function then just removes our place in the line, so the next process will detect that it has now the lowest number in the line and will know it has access.

Where to find the code

The code that I created is published on Github here: https://github.com/dwilmer/dw-semaphore.

I also recorded myself coding this, so if you are curious about my thought process you can check out the 78-minute video on YouTube here: https://www.youtube.com/watch?v=_FuoMuQ7-mM.