A few days ago we got featured in LifeHacker. It was very exciting. We are very grateful for them talking about us but it was also kind of scary: we had been reading about other people getting their service beaten down and definitely did not want that to happen to us or to our users. Luckily, they warned us a few days in advance so we could get ready. This is the story of how we didn’t go down after being hit by a tsunami of lifehackers.
All the pieces are upgradeable and decoupable; we can move the messaging queue to a new, beefier server if the need arises, or serve the browser app from a different server than the API. The applications are completely stateless, allowing us to horizontally scale as much API servers as we need. Customers’ information is held in the MongoDB cluster, safe from data loss in case one server died.
We thought most people would visit just our landing page, maybe browse the static site a bit, and a smaller percentage of the visits would actually go to the app and set up an account. First order of business was keeping the static site up. We also considered that getting some load off the API server would be good, and we could achieve that by offsetting the serving of the web app to another web server. This didn’t satisfy us completely in terms of API stability, so we also thought of raising some new web servers for the API. Finally, the DNS for the site was being served off an old shared hosting server and we thought that might also prove troublesome.
So, first thing we had to do was selecting a DNS provider. We had previous experience with Zerigo (DNS providers for our technical partner, Tecnilogica). They run a very nice, low latency service with a nice user interface that could come in handy if we needed to make changes quickly. We moved the hightrack.me zone to them and set up low TTLs for all the host records so we could react to unforeseen circumstances.
After setting up the DNS, we started surveying CDN providers to offset the static site and the blog to them. Using a CDN was a priority since we expected the heaviest load on our weakest server. We decided to go with Incapsula. The setup proved to be more eventful than what we wanted, but with the great help of their support team we had it up and running in just a few hours.
With all of this, we felt confident that everything was ready to hold the load, but we made even more adjustments. We prepared a landing page with a friendly “Hi Lifehacker, you’ve got us on our knees” message in case the influx of visitors took our application down and readied a kill switch for new users signup. These were last resort measurements we didn’t want to use.
Once we tested Incapsula with the static site, we decided to use it to serve the web application, too. It also works well with SSL, using your own certificate or allowing them to emit one on your behalf. Finally, we set up a new frontend server for the API and set up DNS round-robin to balance in 2/3 with our existing one (2 requests would go to our existing server per each going to the new one). We then sat in a room full of screens and waited.
We were running Google Analytics (real time), the Incapsula dashboard, our own KPI dashboard, Ducksboard, and several SSH sessions to the servers, showing load and logs. Having realtime information is always a nice thing, but in these situations it becomes essential. We are also pampered by our continuous use of Ducksboard: having all the business related information up to date in one place is a must.
Everything went smoothly save for one thing: during the first minutes, some users complained about not being able to log in with their newly created accounts. We pinpointed it to the new API server, so we took it off the DNS rotation and everything went back to normal in a few minutes. Later we found out the problem came from betraying ourselves. During the provisioning of that API server (using Chef on an AWS instance) we ran into some trouble with the Apache recipe (our recipe is custom-written for Apache 2.2 and the Apache on the server was 2.4), we decided to take the recipe off and install and configure it manually. Naturally we forgot a key extension and the application was unable to authenticate users.
A CDN can save your life for cheap. Incapsula service is going to cost us around and kept us alive while getting thousands of new users. Cost per user is really low.
Never, ever, betray your tools. If you have gone (as you should) down the path of automation, do not stray from it.
Good selection of partners will save you a lot of headaches later on. Our mix of Hetzner (bare metal hosting), Zerigo (DNS), Ducksboard (KPI monitoring), and AWS (spot instances) has proven to be reliable over the years. The last-minute addition of Incapsula (CDN) was a nice pick as well.
Being warned of these events in advance is nice. Everything can be set up in a few hours, but setting all this under pressure would be a nightmare. Low TTL on your DNS is a necessary evil.