Last week, we asked for feedback on Reddit. We got a great response from ReinH with many detailed questions regarding Shelly Cloud. This post will hopefully answer most of them.
Our general approach is to shard early to avoid scaling problems. We create small clusters which are capable of handling from 80 to 120 large instances. These clusters basically share nothing.
How does (say) your git repository arch scale with new users and new repositories? There are plenty of systems like gitolite that can handle hundreds or thousands of users but absolutely fall over once you get past that point.
We use gitolite, which works great for now. We were looking into storing SSH keys in LDAP, but it's too early for that, as we don't have problems with number of keys/repositories yet.
Has your system's security been audited by a reputable third party? Systems like this are extremely difficult to secure even for experts.
No, we haven't had our system audited yet.
What are your single points of failure and how do you mitigate them? For instance, looking at your architecture diagram, I can see at least four: nginx, varnish, haproxy, and the shared file system.
Regarding single points of failure our whole front-end is replicated. There is actually a mistake on the architecture diagram showing only one. Each machine has the same setup: nginx, varnish and haproxy. Floating IP's are used to redirect traffic in case one of them is down.
Are you aware that you are using three different load-balancers? Are you aware that nginx can serve static files and that varnish can act as a reverse-proxy load balancer? Why do you need all three when each duplicates the efforts of the others while introducing a new SPoF?
We wanted to use right tool for the right job. Nginx terminates SSL connections and compresses data with gzip. Varnish caches static files. We use separate varnish process per cloud, so that one high traffic application won't take whole cache for itself. Behind varnish we put haproxy. We needed something fast, robust and able to limit connections per backed. It was important because we didn't want to queue requests on single thin when other thins are available. At the time, Varnish couldn't limit connections per backed. Because now Varnish can do that, skipping haproxy is a interesting idea and we will definitely look into that.
If your instances are sitting on top of EC2 or some other IaaS provider, are they distributed across multiple availability zones? If not, is the metal geographically distributed?
We have only one geographical location and it's Germany. When we grow, we will set up more locations and allow users to choose.
How is contention in your backend job processing system handled? If the job before mine stalls, will mine be locked out? How robust is your queueing or messaging system? Will it handle network partitions? Unevenly distributed arrival rates with large spikes? Unevenly distributed processing times with large spikes? Etc?
For backend job processing we use resque. Our level of complexity doesn't require anything more. Queues are monitored and jobs which run for too long are timed out. We make sure that queues are not overloaded.
What kind of latency and throughput does your shared disk solution offer? How well will it scale out? Virtualized IO is notoriously slow and shared disk on top of virtualized IO is ime a recipe for disaster.
Among instances we share only volumes with application files (e.g. user uploads, static files). Those volumes are served from storage servers (which are also redundant). With current applications we're hosting for our clients, there is not much IO on disks, because most of the files get cached on the front-ends. We didn't encounter any scaling problems here, because as I said, we shard early. We avoid virtualized IO for the reasons you presented.
How can you guarantee that resources are shared equitably? Can a single rogue process take over an entire shared CPU?
We use XEN for virtualization with credit scheduler and we set upper limit on how much of shared CPU an instance can take.
Why don't you offer a free tier? Is it because your costs scale linearly with the number of instances you have deployed? What does this say about the viability of your cost model? How can you claim that you're cheaper than Heroku or AppFog if you don't have a free tier?
We don't have a free tier because we chose this pricing model while we're bootstrapping. When we reach certain scale, we'll probably offer one. At the moment you can try our €20 free tier and decide whether our service is worth the price.