DevOps Interview: System design of Netflix

understandingdevop
May 26, 2020
5 min read

Let's understand the infrastructure of Netflix. So anything which involves streaming of videos in this infrastructure is handled by OC (OpenConnect). OpenConnect is the Netflix's owned CDN.

CDN is a content delivery network, it's the distribution of servers sitting across the globe. Also called as caching servers.

For example, a user sitting in India wants to watch a movie, let's say Netflix server is hosted in the US (Origin server). Now packets traveling from India to the US will obviously cause latency. So Netflix has multiple servers sitting across the globe(copy of origin servers, called cache servers), in this case, a user is requesting to watch a video from India, then the very nearest cache server from your location will respond back to the user. Thousands of cache servers are placed in each and every country just for a better experience to the user.

Whatever operations except streaming of the video is completely done on the AWS cloud.

Netflix uses AWS Elastic Load Balancers. So in this infrastructure Load balancing happens first on the zones using the Round Robin algorithm and then load balancing happens on EC2 instances/servers using the same Round Robin algorithm.

Netflix onboard the video in a very interesting way. It does a lot of processing on the movie before delivering it to viewers. It involves transcoding, decoding of the videos.

Firstly it does convert the video in different resolutions. For example, iPhone has different resolutions compared to Desktop, Android phone, TV. So Netflix should make sure that the video file should be available according to the client's resolution.

Also, Netflix makes some file optimizations, like if my internet is slow I may observe some grains on the video screen while watching a movie, and if I am on a very high internet connection movie should be of 1080p or 4k resolution. For this to make happen Netflix has to copy approximately 1000 copies of the same video file(144p, 240p, 360p, etc.)

If a movie file of 1TB comes in for an upload, first it breaks that file into chunks of data, and each chunk of data is processed from a queue and after that workers(here workers are nothing but aws ec2 servers) help to merge those chunks and then it pushes it to AWS S3. Workers push approx 1000 files of the same movie(each movie is a file for each resolution of the movie, eg:- 1 file for 144p, another file for 240p, etc).

Now the data of the movie is present on aws s3, it has to go on to OpenConnect Servers next which are present in different locations across the globe. Once it is uploaded to those OC servers. When users are logged into Netflix and click on the movie to stream, the client will figure out the best OpenConnect server to stream the video from, it doesn't end here but will constantly be checking for the other OpenConnect server which can serve the best streaming experience to the viewers.

Depending upon your searches and your watch history Netflix has some machine learning algorithms running on aws which will do some analysis and suggest you the best content you would like to watch depending on your past watch history or type keywords history.

The next component which is used in the infrastructure is called as Zuul. Zuul Server is an API Gateway application. It handles all the requests and performs the dynamic routing of microservice applications. It works as a front door for all the requests. It is also known as Edge Server. Zuul is built to enable dynamic routing, monitoring, resiliency, and security.

So basically ZUUL gateway is used by Netflix for many things, like shard the traffic, canary deployment, you can define filter-based request

The first request will come to the Netty proxy server (Netty is a networking library for Java: It was designed to support developers who wish to create applications such as web servers, chat servers, message brokers, or whatever networking application).

What is the Netty proxy server?

For example, a HttpDecoder will read binary from inbound, and decode into HttpRequest, a HttpEncoder will receive HttpResponse from outbound, and encode into binary. So if we want to have an HTTP server, we can define a HttpCodec contains HttpDecoder and HttpEncoder, and a ServerHandler that read HttpRequest from inbound and write HttpResponse in the outbound.

Components which are used between communicating between microservices and Netty server are:-

Inbound filter

Endpoint filter

Outbound filter

The first request comes to Netty Server, it will be passing the request to the inbound filters which is used for authentication, routing, etc, and passes the response to the endpoint filter, endpoint filter upon rule decides if it's a static response it sends a response back to outbound filter otherwise it passes the request to microservices to get the response based on the request and finally outbound filter will do the operations based on the filters like zipping, etc. and gives response back to the netty server which in turn gives the response to the client.

When I have multiple microservices and also few dependent microservices and if my one of the microservice is crashing or not responding, the whole response to the main microservice will be failed or timed-out. To tackle this scenario Netflix used Hystrix.

Let's take an example to start with, Here, the enterprise portal service calls one such service called employee information service to fetch employee-related information from an employee directory based on search terms. For portal service, employee information service is a point of access to a remote system, which only creates a point of possible failure resulting in cascading failure for a portal service as well. To prevent such a scenario where a portal service continues to stay resilient despite employee information service being down is something we can achieve via hystrix.

Hystrix achieve this by adding few features like gracefully kill the API request after a few seconds(0.10 ms) if the API does respond back in specified time. Dont send the request to the services whose thread pool is already full. You can set up a fallback default template.

Netflix uses microservices architecture in which there are multiple microservices who communicate with their dependent microservices and the list goes on. So here what they do is separate microservices in two parts, normal endpoint microservices and another is critical endpoint microservices. Example:- When user login to the Netflix, he/she should at least see the home screen and few videos. This is called critical endpoint microservices which are mandatory and critical. They are also scalable.

For the caching Netflix uses something called EVCache, it's their custom caching tool built on the top of Memcached. So interesting part here is they do caching on SSDs and not completely on in-memory. Though SSD response is not as fast as memory, but it is very much faster than a normal HDD. In this way, they also save lot of money with good performance.

Netflix uses MySQL as RDBMS database and Cassandra as NoSQL database. All transaction-related information is stored in MySQL databases which run on the InnoDB engine. They run the Master-Master setup. The write query to main master updates secondary master and also both masters have many read replicas in this way it maintains high availability. Data except billing details, user information which includes bigdata is stored inside Cassandra.

All the logs and events from the different part of the systems are pushed to apache chukwa (Apache Chukwa is an open-source data collection system for monitoring large distributed systems.Apache Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework). Chukwa provides a dashboard for log analytics and then it also forwards the same data to s3 and the same copy is also sent to Kafka topics. and then from Kafka it can filter out the data and push it to elasticsearch.

Kibana and grafana are being used for the visualizing data of elasticsearch. They provide a nice and clean dashboard. They use approx 3500 instances and approx 150 clusters of elasticsearch. Because when a user is facing some issue or error while playing video or anything, the team can debug or troubleshoot it by just seeing on these dashboards.

A big thank you to Tech Dummies youtube channel for explaining the architecture.

DevOps Interview: System design of Netflix

Recent Posts

Comments