He is the CTO of a big Ukraine fashion marketplace. 10-20k orders per day. So the talk is about them surviving load spikes and so.
In 2016 they had a clojure/clojurescript/react single page app. They saw 30% more requests per second, which caused 3x the processor load. Bad news… One of the things he used was clojure.cache and picked the fast memory cache option. After finally reading the documentation, he discovered it was the cause of their problem. A cache call would fail, which would end up in a retry loop which would in effect cause almost an infite loop. Oh, and his son was only two weeks old and he was sleep-deprived. He managed to replace clojure.cache by memcached, which solved the problem.
Halloween 2017. Wife in hospital. They started losing TCP packets… The
main.js was barely loading which is bad in a single page web application
:-) The processor load on the load balancers just kept increasing. One of the
problems was the marketing department that recently added a fourth level to
the menu structure of the website. Which resulted in a 3MB json file with the
full menu. To compensate a bit, they increased the gzip level to “9” which
made it a little bit smaller. But that also meant a huge increase in the load
on the (bad) load balancer that had to do the compressing. Putting it back at
“5” solved the issue…
A regular non-busy day in july. Son is in hospital after a vaccine shot. He was also in the hospital. What can go wrong? Well, the site can go down in the night due to a DDOS attack. The solved it by doing a quick if/else on the attacker’s user agent string in the UI code…
2018, they did a pre-shopping-season load test. It turned out their database
was hit quite hard. So they used
pg_stat_statements to check all their
queries. The table with products was the one being hit hard. Which was
strange, because they cached it really well. Only… the cache wasn’t
working. They missed a key in their cache setting…
Black friday 2018 went without a hitch.
16 november 2020. Black friday just around the corner. But after a new release, the app suddenly starts eating CPU like crazy. Deploying the old release helped. But… the changes between the old and new version didn’t look suspicious. What to do? The took a profiler and started looking at the performance. It turned out some date parsing function was mostly to blame. Parsing dates? Yes, they just started a marketing promotion with a credit card provider with an offer limited to a specific date. So they added the date to the config file. And their was some tooltip showing the date. And there was some library they used that tried some 20 date formats every time… The solution? Parse the config’ed date once upon application startup…
Later they had a problem talking to the database. JVM problem? Can it be the network? Postgres driver problem? PGbouncer? Postgres itself? No. Everything seemed to be working well, only it didn’t work. 20 hours later they stopped everything and started manually executiong SQL select statements in desperation. …. and many of them stayed stuck without an error message??? In the end it was one corrupted file in postgres. So even the super-reliable postgres isn’t always perfect.
My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):