The Good, The Bad, and the League: 11/2 - 11/15
_Your semi-weekly dose of server problem-os, NA League news, and other! (Moved to Mondays for easier updates)_
Worlds is over, True Damage (and Giants) launched, TFT Set 2 is live and ranked is near, and the end of the season is basically tonight, with a lot of last minute grinding. Buckle up, the year is almost over!
* **Mac voice still not working properly** Bug grouped with other Mac bugs. Something to do with .plists
* **Project Akali Rose Quartz Chroma disabled(10/25 - 11/8)** The NOC is notified via email that Project Akali’s Rose Quartz Chroma has missing assets, which cause invisible animations in game. Considering that has some game-breaking issues, the NOC disables that particular Chroma until the assets are fixed.
This bug is fixed by adding the chomma assets back in patch 9.22.
**Server Stuff: **
* **Impacted Match History(11/02, ~9 hours)** Automated alerting notices that Match History values aren’t loading properly. NOC disables Match History while triage teams start to diagnose why the system isn’t uploading properly. Live Producers are called, and the root cause is solved after several hours of analysis.
* **Delayed game stat processing(11/2, ~45 minutes)** Automated alerting notifies the NOC that game stat processing is starting to fall behind due to high volume of incoming stats. The NOC doesn’t escalate and instead watches to see if the lag behind reaches a critical state. It does not, and the system catches up to real time in an acceptable time frame.
* **Delayed game stat processing(11/3, ~270 minutes)** Automated alerting notifies the NOC that data volume for match history is too high. Volume exceeds the ‘wait it out’ expectation, and teams are notified as the NOC disables Match History and enables tickers in the client. This backlog exists until after peak time, at which point the system is caught up and Match History is re-enabled.
* **Network Outage causes Leagues issues(11/6, ~4 hours)** An automated alert from a network deploy notifies the NOC and other parties that a specific program is broken. The network deploy causes a system of interconnected systems to fail, requiring a full triage of all related services on NA. Various queues (Logins and ranked) are disabled while the systems undergo recovery.
* **Honor system causes messaging spike(11/18, ~6 hours)** Automated alerts identify a problem with Honor services. Engineers notify the NOC that they want to perform a rolling restart on the servers hosting the services. The rolling restart is completed, and the Honor system recovers.
* **Error with Freljord Tribe bundle(11/08, ~30 minutes )** Rioters ping the NOC and store teams with errors of arena skin bundle purchases timing out. Investigation shows the bundles aren’t being completed and a re-initiation of the bundle to the database solves the problem.
* **Games failing to upload, impacting Match History(11/1, ~125 minutes)** Match History stops working as the data being transmitted/stored is overloading the allocated bandwidth. Adjustments are made to the service to fix the backlog of games waiting to be stored.
* **Change to a service prevents confirmation email when buying RP(11/11, ~64 minutes)** NOC is notified that a deploy taking place is causing issues with emails. In response, a rollback of the deploy is made, which fixes the email problem.
* **Loot disabled due to insufficient RAM total(11/13, ~3 hours)** NOC is notified that Loot is having problems. The store and loot tabs are disabled, and engineering teams start to diagnose why the systems are having problems. Investigation points to Loot and not the store, which leads to a DB change approved earlier in the day. The DB change wasn’t to correct specs, and has since been fixed.
* **Store disabled due to players being unable to purchase content(11/13, ~2 hours)** NOC is notified that players are unable to purchase content within the store. The store is quickly disabled, and the investigation shows the store isn’t the problem (Loot was, this issue was related to the Loot problem in the previous paragraph). Store is re-enabled.
**Game Stuff: **
* **One Game server host unreachable, causing issues(11/9, ~4 hours)** The NOC is notified that automated comp mode isn’t running on a specific game host. Player Support reports ghost games on that specific host as well. The NOC investigation shows the game server crashed and never fully recovered. That server was restarted and re-initialized. The server recovered and services were restarted on that host.
* **One Game server host causing higher reconnects(11/11, ~70 minutes)** The NOC is notified that one game server host has higher than normal numbers of reconnects. The NOC disables future games on that server to allow the server to drain. Once drained, the server is disabled and added to the check-list of stuff to look at for the next hardware maintenance cycle.