With BFCM 2025 being ~48 days away, it looks like it is on track to break the previous year’s ceiling again. I am sure all of us in this ecosystem are running through some checklist or the other to maximize it for our merchants. This will be my 10th BFCM, no more just the weekend, it feels like BFCM week or BFMC month or even BFCM quarter for some of our merchants 😅
Perhaps a good time to revisit some of my earliest lessons on the path to scaling Swym for our merchants and partners, hope it helps or just share in the fun we had early on.
It was Nov 9, 2016. We were just about exploding in never before seen daily merchant growth at Swym. It was a day for a SaaS-y unlearning I cherish and look back. Background context I usually don’t spend time talking/writing about, but will be useful to share my learnings - Prior to Swym, I was at CloudPact for a good chunk of time where we built Mowbly (acquired by ASG which was acquired by Rocket Software). Mowbly was an end-to-end cross-platform MADP. The entire team size was sub-10 folks for most of my tenure 😄, stretched to the limits (by choice) that even I have done solo sales prospecting meets, complete with exchanging visiting cards in the traditional sales-y etiquette. Yup, one doesn’t hand it off just like that, which of course I did and for some reason that lead never went anywhere 🤔 - clearly no Dwight Schrute even if I tried.
Coming back to my comfort zone, ie technology - It was a PaaS offering, not a SaaS - build, run, host and manage mobile/cross-platform apps - sorta of an amalgamation on MDM, MAM and RAD tooling with OTA updates. Each customer/enterprise was a full tenant, ie each tenant occupied the whole rental unit, ie separate, not shared infra. There was a cloud offering which was hosted on Google Cloud’s App Engine — For which we also won/featured in few early YourStory Cloud Conclave events back in 2011 and pre-SaaSBoomi ones like iSPIRT). Some enterprises chose the cloud, but most enterprises deployed on their own air-gapped in-house bare metal servers with their DMZs and other fun stuff. Every tenant was managed almost with their own resources and more specifically served by their own dedicated database, compute, etc. We probably had 100s to 1000s of such tenants running. (Fun fact for the unfamiliar - DMZ (Demilitarized Zone) is also the same terminology used to define a buffer zone in areas with military conflicts 😳.)
Now back to Swym — In 2016, we were in the process of building the early pieces of our SaaS stack aimed at enabling merchants give their shoppers great experience across channels. Now we power over 46K brands across the world. But in 2016, we were in the order of early 1000s, a few months before we were in Techstars Seattle class of 2017. Drawing from my past expertise at designing a scalable solution, I went about making certain decisions and one of those that had to be unwound on Nov 9, 2016. We were an extremely tiny team back then, ~5 — some of us were “part-time” but committed full-time 🙇♂️. Out of those 5, most were up until ~3:57 am on Nov 8. And for some reason some of those same folks were up again at ~7 am. Clearly not for the faint hearted and clarifying upfront — I wasn’t one of those people.
I’ll try and run through the events, based on our slack messages from that day (and leading) to avoid misremembering notes other than any exchanges via calls.
Nov 8 2016 7:16 am
A merchant pinged us stating that they were getting this error.
502 - Web server received an invalid response while acting as a gateway or proxy server. There is a problem with the page you are looking for, and it cannot be displayed. When the Web server (while acting as a gateway or proxy) contacted the upstream content server, it received an invalid response from the content server.
Context - We had an unrelated DNS outage the previous night, hence the 3:57am messages, but it was resolved. So we let it settle in for a bit longer since the errors were intermittent.
Then we discussed our sync-up plans - even back in 2016 getting all of us in one place at the same time was an accomplishment in itself. Remote FTW!
Things were quiet in terms of errors and whatever alarms we had running for the rest of the day until the night.
Nov 8 2016 10:07 pm
The trend of merchants dropping off was trending upwards - something was fishy. We tried to repro and thankfully found that something is clogging up the service resources very very quickly and leading to the same 502 above.
At that point, our options were to maybe increase the VM size? or maybe add another node? - But, applying a resolution like that was just a mask, we needed to have a reasonable hypothesis.
We dug deeper into each functional step, and fortunately we found out the resource crunch quickly. This wasn’t yet completely clear, we had isolated the erring service and then we began even deeper sifting through resource consumption patterns.
Perhaps the traffic was spiking up? No, no evidence for that, especially not concurrently.
Is it a bot? No, cannot be. But there was a bot running at the same moment. Distracting from the source reason? Perhaps.
Nov 8 2016 10:43 pm (faster typing, words > sentences)
502 again. CPU at 100% - bingo!
Memory leaks? No - CPU.
How about the processes? DB? No, that’s under 22%
No wait, CPU dropped to 0 in all. Fishy...
Okay let’s work on what we know - our logging was excessive, let’s cut that down
The drop-off rates were matching with the CPU charts we were seeing. That’s good-ish.
We were narrowing to the culprit - the db process, building up a hypothesis that supported it. It was blocking everything.
I was against any hypothesis around the database, ie functions/feature problems, looking for any other alternative theory than the db. Why — That meant there were fundamental design mistakes made, ie I made bad choices. So, I wanted to be super certain there could be no other reason.
At that point, we switched from English words to just direct Clojure on Slack comms.
(if existing_site? (go to dashboard) (continue here with setup))
Along with the above modification, we upgraded the node capacity and reduced logging — let’s see 🤞
Nov 8 2016 11:50 pm
What routes were suspect? Which method or webhook?. We landed on a couple.
At the same time — 74% CPU on the upgraded nodes.
This was going to be a long night. (nevertheless we were prepped to bring our best 🙌)
Nov 8 2016 11:54 pm
Should I upload the new code?
Your call.
*Gulps* (not visible on slack :P)
Amidst all that, one of us connected to the dev instance and proclaimed “wait a min...”. But no lucky breaks that night.
Nov 9 2016 12:06 am
mysqld is taking more than 100% CPU
There it was, plain and simple.
A lot of queries with “Waiting for table level lock”
At this stage, we jumped on a quick call to decide if we fight now or patch and punt it for later. We applied a couple of patches and decided to call it an official downtime to give space to analyze further.
We shut shop for new merchants pretty much. It was a gutting feel 😞
We needed a break and distance from the problem, so quite a few unsure moments till sunrise
Nov 9 2016 9:31 am
Our rockstar dev (Who is this you wonder? Not me for sure. IYKYK 😄) had figured out the methods/routes for the overload and she was adding more diagnostics.
We wanted to upgrade MySQL connection types to InnoDB, but come on, table level locks still can’t be a thing in 2016, no? All relevant indexes were added for the major queries as part of our first deployment - “what is going on” feel.
Over the sync up call, we argued — well it was mostly me arguing why it wasn’t a problem even though it most definitely was the problem. It was a challenging call for all of us. We had no new merchants at a time when we were just starting to gain momentum.
The crux of our problem - When we implemented that system, I had insisted on a design to generate new table namespaces for each new merchant we had - clearly that made sense to me based on how I had seen systems scale (ie CloudPact/Mowbly context from earlier). This was the blunder that caused the downtime pretty much - loading more table definitions per merchant we acquired at our scale was disastrous. The scale was tens of thousands, not 100s/1000s that I was thinking of earlier.
The above realization and conclusion came about as I was cycling my usual 5K to work (i.e. our 4-seater space at NASSCOM 10K Startup Warehouse at Diamond District). It was clear — I had to unlearn that path of scale. The cost and value are very very different than what I was familiar with. Something I didn’t need to be told, I had to adapt to the problem at hand, not a great feeling but the correct one to put myself through. The cycling to office was part of the process as well.
Nov 9 2016 10:11 am
When I stepped in to restart the discussion, we were all in agreement somehow, even though I had argued for the opposite just 30 mins ago.
That was another moment I have come to cherish — despite how wrong I was and the heated debate, all of us moved on to what was important - getting us back LIVE. Super grateful to be surrounded by such folks, both at Swym and CloudPact before that. (gifs for representation only)
Getting back to next steps - we quickly concluded on a clear 7-step plan including the table namespace changes, logging changes, archival of spam namespaces and deployment. 7-steps = 1 step per hour at least, no? 😁
Nov 9 2016 3:37 pm
Pushing to prod now
Any objections?
Nope
All done, CPU at <5%
Alright, alright, alright - We were back. A lesson unlearnt, a new lesson learnt, a lesson implemented.
It was time to get back on the mountain of other things that we were ignoring 🏃♂️.
Last slack message for Nov 9 2016 was 9:29 pm. But then again on Nov 10 the first channel message was 3:52 am - back at it ⏲️
Now back to 2025, we have completely eliminated the above system including the self-hosted MySQL instance in question. That system was running our Shopify integration and extensions connecting back to our platform. We have moved all of that to a unified design taking advantage of the latest Shopify extensibility and toml-based deployments including the new dev dashboard + cli.
Speaking of November, here is the latest from our team to enable you to maximize the rest of 2025 with the holiday season imminent
May the force be with all of us this BFCM 🙌✌️
Lovely to read this warm personable recall @aravindB