Eruthyll Wiki April 6th post-incident report

Chise Hachiroku - Sun, 2023 Apr 09Sat, 2023 Apr 22

On 0400@06 April 2023 (BST), Higan: Eruthyll is officially published towards the entire world. However, starting from this moment, Eruthyll Wiki, the official wiki built in the name of C86 Academic, went through a very tough time and directly entered a state of downgraded performance. This report summarises what has happened, and what could improve in the future. In this report, I will refer to the opening time of server as T+0.

Part 1: What happened

Simply put, wiki was overloaded, ‘suffered’ a wave of requests by genuine visitors causing some requests were denied due to lack of free processing power. This started at around T-15min. Initially, the condition is not that bad, with server load at around 85% when Google Analytics report about 900 users are visiting in 30 minutes.

However, as I continue to monitor the server status, some services that is hosted on the same web server but not related to Eruthyll Wiki starts to fail their availability checks starting from T+10min. It was a bad sign, so I checked the server load again, only to find the server is now returning some 503 Service Unavailable messages to some dynamic requests (like filters).

This situation worsened starting from T+30min even with some measures already taken. Serious overloading lasted until T+2hrs40mins, and concerning overloading lasted until T+3hrs40mins. As I am drafting this report, we are now back to normal operations. Editors are starting to maintain contents, writing guides, and adding new features to it.

Our layered caching system has kicked into action, and successfully maintained basic functionality for most users. Basic functionality means any page could successfully open as they should, but further interactions may have delays or occasional failures. Distribution services for images and static resources maintained full capacity throughout the incident.

Part 2: Measures taken during incident

Mitigating measures were immediately taken to prevent the worsening of existing situation. Image and static CDN edge servers were instructed to not fetch new copies from origin server and had their TTL being set to 1 week; a contingency resource boost was also requested to the host to allow for resource overdrive in the next 24 hours. This first batch of actions prevented further development of more service denials and kept Wiki running and stayed online.

Later I accessed the administrative panels and changed how some of the listing pages work, to include some contents that would usually be fetched through another API request to be embedded in the first response. This is extremely important as analytics show these pages are being visited too frequently while the API requests were not cached while they can.

Stop of all processes and ditch of pending requests were issues more than 20 times throughout the incident to resolve incoming requests piling up.

Some configuration values were altered to limit the impact of one stalled process or request, by shortening the maximum processing time allowed. This temporary configuration change also included an aggressive session garbage collection behaviour which might cause user to more likely be signed out. These measures have been reverted as of the writing of this report.

Some points of presence in were also shut down during the incident. All servers in Russia and Africa were pulled out from array to maximise the use of cached contents without needing to send a further request to the origin server. This has not been reverted, but I am closely monitoring the analytical data being sent from clients. (Update: will not be enabled again. Traffic is marginal.)

Part 3: Why this happened

The most important contributing factor is, without doubt, the unexpected surge of visits. Earlier estimations of maximum at T-1day were 75k page views (PVs) per day and later revised to 100k PVs at T-1hr, while the server was prepared to take a surge of requests that would be equivalent of 150k PVs if that continues. The server was anticipated to take about 1000 unique visitors (UVs) in any given 30 minutes and is prepared to tackle any surge equivalent to 1500 UVs if continuous.

But Google Analytics’ real time data showed these values were far exceeded. During the most significant downtime we had a surge of approximately 200k to 250k PVs (in a day if continuous), and constantly reporting 1800-2400 UVs in 30 minutes. These have far exceeded the capacity of our server and causing an overload. This also means that I have wrongly estimated the visitor load that will occur on that day, and I must take full responsibility for this.

When an overload happens, every request would be slower. Any processes that received by the server will become a long queue of requests that could cause all requests to get timed out. When such thing occurs, it could easily end up in a worsening loop, where any intervention may not last.

For best first impressions, the website also utilised a technique called ‘Lazy Load’. This would split rendering of a webpage into two parts – load framework, and load contents. For some reasons, the API requests of content loading were not properly cached, which is a great contributing factor towards the increased server load. All these measures have now been disabled until the underlying problem have been identified.

Part 4: The aftermath

For starters, a purchase has been made to upgrade the server to have twice the processing power, twice the memory quota, and twice the multithreading capabilities. This has brought down the server load and I shall continue to monitor the effectiveness.

An investigation into Lazy Load related API behaviours was conducted to find out why these requests were not properly cached while they should. It turns out the plugin we use has changed its behaviour and it would now send POST requests instead of GET. This means a lot – all dynamic requests will be conducted in a way that will bypass all the caches until the final object cache, and the server will recalculate all the responses on demand, draining away a lot of calculating powers.

As a result, the Lazy Load will be turned off at least in the foreseeable future. I have also turned filters to be applied on request instead of on change – this would save a lot of traffic, and these GET requests could be cached promptly.

An optimisation across the website was carried out to identify any performance-impacting problems and was addressed. This includes way too short TTL for object cache, small pool for caching, unnecessary table entries in the database that should be removed beforehand, and narrowed the scope of auto-purge on page and post’s modification and publication to essential pages only.

Eruthyll Wiki April 6th post-incident report

Part 1: What happened

Part 2: Measures taken during incident

Part 3: Why this happened

Part 4: The aftermath

Share this:

Have your Say Here 取消回复