CLOUD COMPUTING: A VERY BRIEF REVIEW – Part 1
By Frank Kane
This article is an excerpt from our “Mastering the System Design Interview” course. For this section we will dive into a brief review of the various cloud computing technologies out there, and how they connect to the system design interview.
One thing that’s changed in system design interviews is that it’s not always necessary to design things from scratch. We don’t always have to assume that you’re going to be designing your own layout of servers in your own data center. Oftentimes, you can just use an existing technology within one of those cloud service providers like Amazon Web Services or Google Cloud, or Microsoft Azure. And sometimes, that might be a perfectly appropriate thing to invoke, and it can save you some time and trouble. So, let’s get started!
Again, these are just tools in your toolbox that you can draw on during a given system design problem. I’m not going into a lot of depth here; I could spend hundreds of hours talking about each one of these services if I wanted to. The objective here is to know these services exist and you can call upon them as needed as part of your design.
I’ve made three columns in the chart above, one for Amazon Web Services, one for Google Cloud, and one for Microsoft Azure. They all have their own offerings for these basic general classes of services.
Let’s start with storage. You have to put your raw data somewhere, right? If you’re being asked to process a massive amount of data, that must start in some location. These storage services can store pretty much anything (technically they are “object stores”.) Unlike a database, they are not limited to structured data.
AWS’s storage solution is S3, the Simple Storage Service. S3 is just a place where you can store objects across the cloud within AWS. You pay based on usage and the prices are cheap. If you need to store a massive dataset, you can throw it in S3 and then use additional AWS services to process that data and impart structure to it.
Google Cloud offers cloud storage of its own, and Azure has different flavors of storage services. You can ask it for disk, blob, or data lake Storage, depending on what you’re trying to do. There’s that “data lake” term again. That is the concept of storing a massive amount of unstructured data somewhere, imparting structure to that data, and querying it as if it were structured. A data lake needs a massive storage solution like S3, Google Cloud Storage, or Microsoft Azure Data Lake Storage to store that data in the first place.
Let’s also talk about compute resources. If you need to provision individual servers and you want to have complete control over what those servers are doing, they all have solutions for that as well. Amazon offers EC2 which allows you to rent virtual machines as small or large as you want. That can even include different flavors of boxes that might focus more on GPUs than CPUs or might focus more on memory or storage speed. Whatever it is you need to optimize for, they have a specific server type you can choose from. If you’re doing deep learning, you might want to choose one of their big GPU instances to throw the most muscle you can at a big deep learning problem (they won’t be cheap, though).
Similarly, Google has Compute Engine, which is the same idea. And Microsoft Azure just calls their offering virtual machines. Every cloud provider has a solution for renting virtual machines on an as-needed basis and being charged by the hour for how much you’re using them.
If you need a big NoSQL distributed database, we can do that too. DynamoDB is the go-to solution for that on AWS. Google Cloud still calls it BigTable, and they have some more specific services for more refined use cases. Azure has something called CosmosDB or Table Storage. All three providers offer a distributed NoSQL data store that will allow massive scaling of key/value lookups.
Containers are also a big deal. If you want to deploy code to the outside world, putting that within a container is a modern operational practice. These days, Kubernetes is winning the battle versus Docker for what’s popular on the cloud services. All three services offer some sort of Kubernetes service. On AWS, they call that Kubernetes on ECR or ECS. Google Cloud also offers Kubernetes, and Azure as well.
They each offer solutions for data streaming as well. You can always just run Kafka or Spark Streaming on a compute instance or on Amazon’s Elastic MapReduce (EMR) service. But there are also managed, purpose-built services for streaming. AWS has something called Kinesis that’s used for data streaming, which integrates tightly with other AWS services. That’s just used for getting data from one place to another, and maybe transforming it and analyzing it along the way. Google Cloud calls the same thing DataFlow, and Microsoft Azure offers Stream Analytics.
We can also briefly discuss Spark and Hadoop. How would I deploy them in the public cloud? On AWS, they have something called EMR, which stands for Elastic MapReduce. The name is a bit of an anachronism because you can use it for much more than MapReduce these days. Specifically, you can also deploy Apache Spark on it, as well as other streaming technologies and analytics technologies. But the nice thing about EMR is that it manages the cluster for you. You just say, “Hey, I want a Spark cluster with this many nodes. Go create it for me”. And EMR says, “Yup, here you go. Here’s your master node. Go run your driver script here. And it’s all set up and ready for you.” EMR saves you a ton of hassle in provisioning and configuring those servers. You just get a Spark cluster that’s ready to go.
Similarly, Google Cloud has something called Dataproc, and on Azure, they have an implementation of Databricks. Databricks is a very influential company in the world of Apache Spark and a big contributor to Spark itself. If you’re a fan of Databricks, Microsoft Azure might be your platform of choice.
For larger-scale data warehousing, they all offer solutions for that as well. On AWS, we have something called Redshift. Again, you just tell it, “I want to provision a data warehouse that has this much storage capacity,” and it says, “Okay, here you go, go to town.” It also has a variant called Redshift Spectrum, which can sit on top of an S3 data lake and issue queries on unstructured data as well. Google Cloud still offers BigQuery, its original technology for distributing SQL queries or queries in general, across a massive dataset. And on Azure, we have Azure SQL or Azure Database.
Finally, let’s talk about caching. On AWS, we have something called ElastiCache, which is just a wrapper on top of Redis. And on Google Cloud, they call it Memorystore, which can be Redis or Memcached under the hood. Azure offers a Redis solution as well. It seems like Redis is winning the battle against Memcached in the public cloud. All three platforms allow you to deploy your own Redis server fleet and manage it for you.
No matter the system the goal is always the same. If you’d like to learn more about any of these cloud computing platforms before you’re systems design interview. Enroll in our courses at www.sundog-education.com
Or get all of them in our 10-Course Mega Bundle for only $50! Click here to learn more.