A Possible Approach of Designing and Utilising a Multi-Tiered Caching System with Central Backend

Chise Hachiroku - Wed, 2023 May 24Wed, 2023 Jun 28

This is a lengthy technical article with a considerable coverage.

This article have a word count of over 4500. It may take 25 to 30 minutes to finish for those with some prior knowledge, and longer for those new to this concept.

Non-urgent advice: Disclaimer

This article has documented the design and considerations behind the build of Eruthyll Wiki, my personal homepage, as well as this showcase website. Although this set of techniques have served me well, it may differ from person to person, especially when the requirements are differed. Always do your own research before fully accepting one man’s claims. This article is written between May and June 2023, my later publications may overturn arguments laid out in this one.

It has been a long-lasting problem that how can we design a web-based service that has ensured all aspects the visitors may reasonably ask of a system – availability / reliability, consistency, and performance.

For us, the website maintainer, and administrators, it would be an expansion from the elements of famous CAP theorem – consistency, availability, partition tolerance, with addition of fast response from the server.

In this article, I will discuss how can we design a multi-tiered caching system to maximise the goals when there is only one central backend and a central data store. I will also give a brief introduction to all the techniques and their workings so by the end of this article you may have a good coverage of basic concepts involved.

But before we start, one must ask themselves…

Is this the right thing for me?

General visitors do not get to make some changes that would have immediate effect.
I am able to make one central backend, which have a good connectivity to the edge nodes.
The timings are not that important for me – I am not running a news service or something similar.
The requests may come from anywhere on Earth, and it will overload my server if all are processed raw.
Users must log-in to use the service, and the data served would differ per-user.
Raw enquires that need to write to a database have a considerable proportion among all requests.
I have enough resources to make, build, and maintain a distributed system of multiple backend nodes.

Checked? Let’s go.

Before we start designing, we must understand what Cache is, and why it is useful. Cache is basically a memory space that allows quicker access to the information needed at a smaller cost. It allows us to improve availability to be ‘always online’, performance to have faster response, and scalability to be able to cater more visitors and requests. Usually, cache is a technique that uses memory spaces to obtain faster speed, but at times, correctly using cache can also lead to the decreased use of both.

We must understand that, the ultimate target of caching is to ease the workload of our backend, and prevent any part of the server to be overloaded. Overload of any part of the system could be catastrophic, as it could cause starvation, where all requests in the queue was not fulfilled as the processing would not start before the receiving end has already timed out.

Make system suitable for cache.

When the system would have a single backend, there are rarely any case need to consider partition tolerance. This is since all write requests will be sent to the same database where they could be serialised and executed in the order where there will occur no interference.

However, although modern systems are usually fully capable of handling the requests, we must consider the edge cases where two users may have sent two individual requests that one may remove the prerequisites of another, but that is the problem of software engineering and database systems.

The system must be able to separate the read and write requests, preferably through reliable automation or configuration by someone knows what all these requests could be. For a web service, it is always important to note that only READ requests are truly cacheable, while WRITE should always passed to the backend.

A good way to implement this structure may be by distinguishing through what requests are made. If properly configured, you would be able to cache all GET requests, and make the pass-through requests to be only done using POST. However, this may be risky in the event where a kind of frequent read request were sent by POST requests, like in the first day of Eruthyll Wiki.

Another way to implement proper cache rules is to create a whitelist or blacklist to enumerate all URIs that should or should not be cached. Although at times this could be handled by cache solutions, but always remember these are not reliable ways to do so and the outcome must be validated and add exceptions where necessary. Yet, in this method most POST requests still need to get priority to reach backend.

The system also needs to be able to correctly handle resource locks, and ensure ACID of all database transactions. Using an existing and mature solution is much preferred, especially when the reliability of custom-made systems could not be guaranteed.

Cache as Precompiled Library.

Please note, is you are using a modern version of PHP (>=5.3), or is not using a script language, you do not need to do anything to get this. But to introduce the function of cache, this is a good place to start.

Only a very few people are still writing machine code these days, and even if they do something similar, it is mostly assembly. But for the vast majority of us, we are accustomed writing structured imperative programs that usually requires compilations. Although Just-in-Time compilations are already something mature, but what does not change in the near future is that compilation remains an NP-complete problem, which means we do not get to solve it in any realistic polynomial time.

For code to be compiled into binary, there are two commonly used approaches. One is directly compiled to machine code, commonly seen on computer applications; another is compiled to bytecode or operation code (opcode) that can be easily converted into machine-specific codes. For scripting languages like PHP, it is the latter, where the PHP executable will parse and compile the script into operation codes on-demand, and then the runtime would execute the opcodes.

As previously mentioned, compilation is an NP-complete problem, and if we do this for each request, no doubt there will be a lot of unnecessary computations, or overheads. Since compilers are not based on outcomes of machine learning, the operation codes would not change before the source code changes using the same compiler, and hence, cacheable.

Of course, it is not a simple task at all, and the implementation varies between different variants and PHP versions. But let us think about what caching could do for us: saving and freeing up all the time and resources we need to spend when we could potentially to a table lookup in logarithmic or fixed time.

Properly configured, we can also cache operation codes for scripts when the server is idle so we do not have to go through the hard ones at peaks, and things could work as if the PHP is a precompiled language. It could be a big saver when we are having a very considerable number of requests!

Although PHP now come with OPCache, it is still better to have options to purge opcode cache manually, especially when the contents of some scripts have altered, updated a CMS, or a dependency. In developing environments, it is worth considering turning it down for debugging and prevent cases where some new code do not go well with cached old snippets, or changes do not get reflected.

Now we know what caching is capable of, let us go to the realm that is beyond the standard issue.

Cache for Internal Queries.

One of the frequent bottlenecks of a web system is the performance of the underlying database. Although we have went a long way to enable us to reduce time and memory complexity of database, operations, queries, especially complicated queries generated by filtering, searching, as well as pages gathering information from various tables, would still take a very considerable computation power to complete and will incur a lot of seemingly random reads from slow medias like SSDs or HDDs.

However, for most applications and systems, users’ requests tend to generate similar database queries very frequently. This is usually caused by a summary page that is shown to the user by default, filters that are of particular interest to users, and other situations where a page or a certain section of a page is being frequently checked. Such requests could constitute more than half of all queries at times, and if we do the query each time it is being requested, we can imagine the workload pressed on the database.

So, the reasonable solution here is to cache the results of queries, especially the common queries made. The ideal target it the frequent queries could hit cache and return a seemingly correct result without firing them directly to the database. This is called Object Cache. Usually, this is made happen in conjunction with Memcached or Redis, but choosing which for your website largely depend on your needs. Here is a comparison laid out by Amazon AWS.

Feature	Memcached	Redis
Sub-millisecond latency	Yes	Yes
Developer ease of use	Yes	Yes
Data partitioning	Yes	Yes
Support for a broad set of programming languages	Yes	Yes
Advanced data structures	–	Yes
Multithreaded architecture	Yes	–
Snapshots	–	Yes
Replication	–	Yes
Transactions	–	Yes
Pub/Sub	–	Yes
Lua scripting	–	Yes
Geospatial support	–	Yes

Choosing between Redis and Memcached, Amazon AWS

But this also poses challenges as to how should we construct our queries. Because of the high overhead that SQL queries usually need, there are a lot of applications and plugins that tend to minimise the numbers of queries. In many cases this would lead to quite long and complicated queries being designed and generated. This sometimes mean part of the queries that are actually cache-able are being conducted again due to the parts that are constantly changing.

We also need to think about what queries should we cache, as well as what cache values should we keep and drop, as the memory space is not unlimited. Some common strategies include: maintain a cyclic caching memory (but has nothing to do with frequency); keep the entries recently accessed (but some frequent ones may be purged out); keep the entries that have most access frequency in a given period (but some might be inactive in quite a period); or, manually maintain a list of queries that has been identified as worth caching.

Before we conclude we must be fully aware that we must only cache the requests that would not make any permanent change to the database, such as ALTER, CREATE INDEX, DELETE, DROP, EMPTY, INSERT, and UPDATE. Why? Because they are not cacheable at all!

Cache as Store for Outcomes.

As we have discussed before, a page may need several requests to complete, and for sure it will need the PHP runtime to execute the operation codes to compute the responses of the requests. However, as previously mentioned, for a web service there surely will have pages that are more frequently visited than others, like the homepage, listing pages, or detailed pages that is of particular interests to users. Caching is the technique to use memory space to save the unnecessary, duplicating computations, and in this case, we store the outputs of the computation.

This is why we need the system to be designed facing the visitors for most of the times, as logged out visitors do not get personalised response in such systems. In theory we can also create cache for separate components, but we must weigh the advantages against the cost – too many cache entries will lead to slower finding speeds, and deteriorating the performance for visitors. If we are to construct pages from some fragments, then it might be best to have some edge computing in place instead of simple caching techniques.

Cache for outcomes can be generated as part of the publishing process of articles, pages, or items, so that we do not need to compute when it is being requested for the first time, which may be of times with high server load. We must also take an eye on the concurrency problems, as multiple clients might ask for the non-cached page, then all miss and enter the generation phase. A way to overcome this is to have new generation being set to async and saved in a queue with TTL control to prevent, while serving the obsolete version for the last time.

It is an option to store a copy of computed outcomes for all pages served locally with some exceptions, allowing much of the website to operate as if a static one and would have near static level performance. This store does not have to be in the memory, and could be on an SSD or Optane drive as well, but it would be ideal if the look-up table remains in the memory, to have optimal response speed for both cached results and missed requests.

For maximum performance, we can additionally have these outcome data be sent to multiple edge servers at different locations using a content delivery network (CDN) aiming to bring data closer to users, of which the most common approach is to save a copy of the served content when such a request is being made, and keep it in the storage for a given time. This is quite easy to reason but inevitably there may be some time latency between the origin copy and the copy served by the edge servers or points of presence.

Cache copies also ensure that the most critical functions of the service to be more fault tolerant as the copies can avoid having a single point of failure for content serving, and the website could return seemingly correct responses during incidents at the origin. If existing commercial CDN services is used, we must understand its basic caching strategy at the edge servers and optimise the requests accordingly. We can optionally elect to use workers at the edge servers to accommodate more complex caching strategies, such as constructing a page using common headers and footers, and different contents for elsewhere.

Cache for Updated Contents.

The time difference could potentially damage the user’s trust on the information they have retrieved from the website, and hence must be mitigated. A more resource-intensive is to have edge servers to send a request back to origin to ask if there is a change or have the origin server to respond to requests having no change to the cache with 204 No Content or other suitable HTTP codes of your choice at a regular interval. But this would mean for all requests there will be a waiting time to get a confirmation from the origin, and we cannot assume the connection between edges and the origin is always stable.

But in the above method, we are letting the edge servers to constantly check with the origin, although the load is small, but it is still a bit resource intensive, especially when there is a large number of points of presence. A better way of having this done is to actively push the changes to the edge servers, and depending on what is available from service provider or infrastructure there are two ways of doing it. One is by sending a purge request for the changing URLs, and let the edge servers to either immediately collect the newer copy (there will be a surge) or to get a new copy when there is someone requesting (may be slow). Although it is always possible to mix by serving a stall copy and request a new one when request is sent, accompanied by active request triggered by regular checks, but why be so complicated?

On both the origin server and the edge servers, we must correctly identify the pages that a change would affect, and invalidate the caches related to them. If it is just an article, then commonly it is just this page, archives, and possibly the homepage. However, if it is a widget, a menu, or anything else that is being constantly reused by a range of pages, it is necessary to correctly identify the pages that needs to be purged. In case we cannot identify the range, a site-wide purge may be worth considering. After regeneration of a page, we need to notify edge servers if needed.

When there is a large number of pages needs to be purged, it might not be feasible for us to do a re-calculation of all pages at once due to performance bottleneck. One of the methods is to queue all the requests and process them in small batches through cronjobs. The downside of this is obvious – updates would reach users more slowly, or some users may need to wait more time before response.

We also need to identify the pages where the contents would change periodically, like a page with calendar, or an event page where the contents could differ through time. We can make these changes through a combination of regular cronjob and action scheduler, where pages could be purged and submitted for regeneration at the times needed.

Without doubt, to accommodate the needs for changes we need to develop the whole system surrounding the caching mechanisms, which is not always possible when a lot of components are solutions supplied by a third party. An easier and more reasonable way to do updates for all pages for both manually introduced and automatically triggered changes is to introduce time to live (TTL) limit for the cached copies. Pages would be regularly processed for a synchronous or asynchronous regeneration (could be a simple renew of TTL if nothing has changed) through requests after the current cached copy have expired.

Tuning TTL for Best Outcome

Although with all of the best practices above would not need to tune for the TTLs, but in reality, cache with TTL is the most common way of handling of content changes. However, using TTLs also bring us an exceptionally tricky problem of how long a cached copy of contents should be valid before getting a reevaluation.

Although the most suitable values for varying pages and categories will surely differ, following are the factors to consider when setting appropriate TTL values:

The frequency of content updates: If the contents of that page or category is updated frequently, a shorter TTL should be configured. This will ensure that visitors always see the latest version of your content, and to reduce the possibility of users getting an obsolete copy of contents and loose trust of the content served.
The validity of cached content: In some cases, we could have a clear idea of for how long the cached content would be valid for – for example a monthly calendar would be invalid every day at 0000 hours, or event of the day may need to be update at o’clock of each hour. For these we can determine the right validity and set TTL accordingly.
The importance of the content: If the content in question is important or frequently visited, a longer TTL could be configured. This will help to improve the performance of the website by reducing the number of requests that are submitted for computation.
The cost of computation: If a page requires extensive computation like have the need to read the database for a considerable amount of times or contain an algorithm that have a high time complexity, a longer TTL should be configured to reduce the server load and save the server from these burdens.
The memory usage: The TTL configuration can also affect the memory usage of your website. If you set a long TTL, we may end up caching a lot of content that is not frequently accessed. This can lead to wasted memory use.
Overall server load: If the origin server is struggling to fulfill current requests, a longer TTL should be considered. Weighing between obsolete data and downgraded service availability, it is obvious that denial of service is of greater severity, and increasing TTLs can reduce the number of requests being sent for processing.

Importance may shift when a content has become obsolete, updating frequency may increase or decrease due to factors no one could predict, and computation cost may arise when more entries are added to the system. Constant changes to the factors and visitor behaviours render the TTL to be a value that needs to be tuned constantly to achieve the goals we are striving to achieve, and we should monitor analytical data for tuning.

Optimisation of Unchanging Contents

All the sections above we have been discussing the caching of dynamic contents, but we also need to take care of the static contents that usually does not change. This include image, CSS, and JS files, as well as some lasting JSON and configuration files that is only specific to a large level covering a wide range of pages or services.

These files usually do not need any computation to serve at all, so they do not have need for object cache, have no need to cache for outcome, and TTL is (to some extent) infinity, the only thing left we can do is data duplication and bring these data physically closer to the visitors to improve response time and distribute bandwidth.

However, for static resources some characteristics are different than that of dynamic ones and may require a different approach to be applied. Dynamic responses tend to be smaller in size, prioritise in response time over throughput (have the smallest latency for visitors), do not allow lossy compression, and have little or no useless part in them. In contrast, static resources tend to be larger, require high throughput, could be compressed in a lossy manner, and have redundant parts.

Different file size and network requirements means we will need different configurations at edge servers for these static files, where the main task is how to distribute them at the shortest period possible (achieve the largest connection speed), and do not need to be particularly worried about validity in the scope of time. This means we may want to have these being served from a different domain and IP and thus it is better to ask the client to do a DNS prefetch for that domain.

There are some optimisations we can do over these unchanging contents, but for each content type our options are different.

For pictures, we can serve in the formats that are smaller in size. Currently the most widespread optimised file format is webp, and in recent years the file of avif is also getting a wide support across browsers. But to serve formats other than the very old formats, we can either only provide them to compatable browsers by recognising HTTP Accept header, or create a fallback functionality through HTML annotations or JavaScript.

To avoid large content shift during the loading process, we can also create placeholders of the images along with dynamic requests that are small in size, like a colour brick using SVG or corresponding low quality image placeholders (LQIP) that can be embedded inline in HTML files in the base64 format. However, generating LQIP will take some resources, and embedding will enlarge the size of dynamic contents served to some extent.

For CSS files, we can apply CSS minify technique for all files. CSS minify is the process of removing the unnecessary formatting such as indentation spaces and unnecessary newlines. This is a fairly safe technique to use because removing these things do not affect how this file will be parsed by browsers and hence would behave exactly like before. CSS minify could be achieved at edge servers as this is a service offered by many CDN services. Additionally, when situation permits, we can also shorten variable names to further reduce some bytes.

Yet it is true that not all contents of the CSS files will be needed for every page, as not all the selectors in each CSS file exist in all pages, so there are room for more improvements, but are not as safe as previous. We can generate minimised and reduced copies for a specific category of pages or some immensely popular pages where we know for sure that some elements will not be shown, but if we accidently deleted some that is being used or returned through indirect calls, it will cause problems. We can also do CSS inline to place only relevant parts along with these elements, but when there are multiple conflicting CSS files affecting the same elements, the rendered outcome may differ from before. So, any further optimisation will require through testing before implementation – or things might break.

For JavaScript files, the first things to do is also by reducing unnecessary spaces and newlines that do not affect parsing, but what goes after is tricky. Unlike the declarative CSS, JavaScript is a mix of imperative and declarative syntaxes, and there are a lot of customised symbols in the code like functions and variables, not to mention that one JavaScript file may need to interact with other files they may be unknown to the optimiser. Again, this is an area where we need to choose the optimiser wisely and make tunings to exclude the part of codes where the chosen optimiser breaks things.

For audio and video files, we can consider instructing the clients to fetch the document by segments, and place the first part of the frequent files in the memory. It is suggestable to have multiple versions with different bitrate and size prepared, so we can serve the appropriate version to different users. For example, we can serve the best quality possible for those who is connected through high-speed internet, and lower quality to those using cellular network, then the reduced size for those with poor connection. It is advised that these file to be fetched with lower priority than the critical CSS and JS files, as they may take a long time and occupy a substantial proportion of bandwidth to process.

And finally, there are files that can hardly get any optimisation beside using a different edge server, these are files that are binary, compressed, or those have integrity requirements. But like what we have for object cache, we can always place some popular files and an index in the memory for faster access.

Summary and Conclusion

Since we have covered a lot of topics, let us go through all the things we have discussed.

In terms of processing and computation, in tiers:
Object Cache -> Page Cache (Outcome Cache) -> Copies at Edge Servers
In terms of different kinds of resources, in tiers:
Dynamic Cache, Static Resource Cache, and Cache of Optimised Static Resources
We have also discussed what should we do to deal with content updates.

Basically, the multi-tiered structure is built while we are trying to reduce the need of duplicate computation and bring data closer to users at various levels through implementation of cache, aiming to improve performance, availability, reliability, as well as consistency to the greatest extent possible.

Although we have covered a lot of concepts and materials, it is absolutely true that there will be a lot of varying factors affecting the application of caching techniques, and it surely will take a lot of practice to have a fine grip of it.