Scaling WebRTC applications is important if you want to utilize them for large events with heavy traffic. This article contributed by Anton Venma talks about implementing scalable WebRTC solutions.
“How do I scale this?”
This is a question that is asked eventually, often too late but never too early, by everyone that builds successful web-based software products.
The importance of a plan for scalability is hard to overstate, but unfortunately, it’s often left until the last moment when something has already started to fail. It’s even more important for applications that incorporate Web Real-Time Communication, or any other real-time communications technology, especially if video is an integral component.
So how do we scale WebRTC-based applications? Thankfully, it’s not that hard. You just need the right pieces in place.
How to Scale WebRTC Applications
To get started, take a step back and look at how your application uses Web Real-Time Communication. Ask yourself these questions:
- What is the maximum number of users I need to support in a session?
- What is the average number of users I expect to support in a session?
- How many users will send media within a session?
- How many users will receive media within a session?
- Will my users typically connect from residential or corporate networks?
- Will my users connect in from high-security networks (e.g. hospitals)?
- What do I need to record, and with what guarantees?
These questions require some forethought and research, but they will help you decide on the technical architecture that most closely matches your needs. Once you know the answers, you can choose between:
What is a Mesh Architecture?
Mesh depends almost completely on the client’s capabilities. Each client connects directly to every other client, resulting in each client managing
(n-1) bidirectional connections, where
n is the total number of clients. With
3 clients in a session, each client has to manage
2 separate connections. Since each connection is encrypted uniquely, each client is responsible for encrypting and uploading their media
2 separate times.
For the math-oriented, this type of architecture can be represented as a complete graph, where each vertex is a client, and each edge is a connection. The total number of connections is exponential, equal to
n * (n-1), which starts out small but adds up in a hurry.
Since most Internet connections are asymmetric
(download bandwidth > upload bandwidth), upload bandwidth can become a limiting factor, especially if we’re using data-heavy video. Having to encrypt each media stream once for every connection doesn’t help, especially on mobile.
That said, mesh requires no server infrastructure beyond the requisite signaling and TURN server(s), so for applications where sessions typically involve 2-3 clients, the mesh is an excellent, low-cost approach.
What is a Forwarding Architecture?
Forwarding depends on a selective forwarding unit, or SFU, which acts as an intelligent media relay in the middle of a session. Every client connects to the SFU once to send media, and then once more for every other client, resulting in each client managing n unidirectional connections, where n is the total number of clients. With
5 clients in a session, each client has to manage
5 separate connections. One of those connections is reserved for sending media, while the others are exclusively for receiving media.
The total number of connections in this architecture is
n2. While this is more than a mesh architecture, it is far more scalable for your clients. It plays well into the asymmetric nature of most Internet connections by requiring each client to upload only once. Related to this, you are also only encrypting once, which alleviates pressure on the device itself, especially on mobile.
One of the big advantages forwarding has over a mesh architecture is that it can employ various scaling techniques. Because it acts as a proxy between the sender and receiver, it can monitor bandwidth capabilities of each leg and selectively apply temporal (frame-rate) and spatial (resolution) scaling to the packet stream as it moves through the server.
Forwarding requires additional server infrastructure (the SFUs themselves) but is highly efficient. An SFU doesn’t attempt to depacketize the stream (unless recording is activated) or decode the data.
What is a Mixing Architecture?
Mixing depends on a multipoint control unit, or MCU, which acts as a high-powered media mixer in the middle of a session. Every client connects to the MCU once, resulting in each client managing just a single bidirectional connection, regardless of the number of other clients present. The connection is used to send media to and receive a mix of media from the server.
Like forwarding, we are only encrypting and uploading once, but now we are only downloading and decrypting once as well. From the perspective of the clients, this is the most efficient approach. From the perspective of the server, this is the least efficient approach. The burden of depacketizing, decoding, mixing, encoding, and packetizing is borne entirely by the server. While there are many things that can be done to optimize this pipeline, it is still a fairly heavy set of operations that require significant server resources to complete in real-time. In addition, since video layout control is centralized, each participant sees the same view. This limits the ability to customize individual user experiences or separate out the view of yourself in the mix.
As with forwarding, mixing can take advantage of scaling techniques. By applying temporal and spatial scaling to the output of the audio and/or video mixer(s), we can adapt quickly to changing conditions along the network routes to individual clients, keeping the quality as high as possible and without affecting other clients in the session.
For applications with large numbers of active participants, like virtual classrooms, or cases where devices are particularly bandwidth- and resource-constrained, this is a great approach, albeit a potentially expensive one in terms of server cost.
What is a Hybrid Architecture?
Hybrid architectures are, as their name implies, a combination of mesh, forwarding, and/or mixing. In a hybrid environment, participants can join a session based on whatever makes the most sense for the session or even for a particular endpoint. For simple two-party calls, a mesh setup is simple and requires minimal server resources. For small group sessions, broadcasts, and live events, forwarding will better meet your needs. For larger group sessions or telephony integrations, mixing is often the only practical option.
Hardware Requirements to Scale Your WebRTC Application
No matter which architecture you choose, keep in mind that you will always need:
- A signaling server
- A TURN server
With a couple of signaling servers and a couple of TURN servers to support high availability, you’re looking at a minimum of 4 servers to start. You’re up to 6 servers once you throw in a load balancer and a back-end database or caching service.
When selecting your server hardware, we recommend a general-purpose or compute-optimized server for signaling and a network-optimized server for TURN. If you’re using Amazon Web Services, the M- and C-class servers are excellent choices. If you are using Azure, the A-class servers that are computed- or network-optimized are the way to go.
Now that you know the four different architectures, which one will you choose to scale your WebRTC-based application?