Load Balancing multiple gigabits of live streaming

Poli · July 3, 2020, 10:30am

Hey,
Some of you already know it I am working on a streaming platform and I have again a question for load balancing you surely hate me for it

But how could I do to load balance let’s say 100 gb/s ? Without having an URL change? Maybe it needs multiple machines. But what I would need is simply a thing that says me alright the server-A has 50 peoples Server-b 160 (not any users can go on it) Server-C 10 users go on this one so not even a reverse proxy.
I could do it in PHP but looks hard just to do this tiny thing.

So how does Cloudflare, Route 53 load balances so much bandwidth?

And what should I do to limit connection per servers and always select the least full one?
How does Twitch manage that for example?

ac_fan · July 3, 2020, 11:44am

Use a mix of Geolocation routing and Weighted routing in Route53?

ruben · July 3, 2020, 1:45pm

Balancing 100Gbit/s + is not an easy task. Keep in mind, such Bandwith requirements outgrow many Networks, therefore saying I need to deliver 100Gbits is a bit of a tricky, in the end, you will need to know exactly where this traffic is coming from and to which location if you want to build it on your own so that you can do proper load balancing across multiple regions and within one.

A simple solution is as @ac_fan said Georouting, eg, Germany goes to Frankfurt, France to Paris, etc.
You could also use a classic Load Balancer Solution to distribute the load across multiple servers within one geo region, AWS/Azure/Google have a ton of products just not for cheap.
Changing the video link will be needed in most cases. If you do video streaming you might want to host the video on another “link” so that you can change it on the fly like Google does it.

Filling up a Server after a Server is not quite a good idea IMHO. You can do that on a specific region but you would end up pushing People to not ideal locations if you do that on a global scale.

Cloudflare just uses Raw Power/Bandwith in most cases since its mostly Anycasted, the also publish a lot on their blog about how they do stuff.

Akamai, another large Viode CDN does a lot of Region Specific Routing and a lot of GeoDNS as well to guide Traffic from certain regions to the PoPs.

mrclown · July 4, 2020, 11:34am

Real-time detection of traffics needs url change directly or indirectly. On the other hand, simple load balancing via DNS can be much simpler with Route53 or FlexBalancer.

You might be splitting a few of them in 100 x1 Gbps or 10x10Gbps or 5x20Gbps or etc. For such scale, going with CDN provider will be better option as not all DC can give you 100Gbps at one go especially for live streaming without spending thousands of dollars or even more.

Poli · July 4, 2020, 12:54pm

Thanks a lot for all your reply,
I was thinking using multiple 10gb/s or even 1gb/s nodes. I looked a bit around and sadly the other streaming services are too expensive i think for what it brings.

The thing is that how can i make it that i limit per server the connections (let’s say i have a limit of 140 pepoles in 1080p per servers). How can i manage that the loadbalancer if he hits for exemple 900 mbit/s on a 1gb/s machine simply says use another machine?

Flexbalancer really looks nice i never heard of it. I think i simply need a loadbalancer that in case it touches 1 or 10 gb/s on a machine do not add any other clients to this machine. In cloudflare you can maybe do this with healthy checks and doing reply codes using GET, but it’s reallllyyy not a good solution at my eyes.

Is it maybe possible to rate limit machine in a loadbalancer? I never tested anything other than CloudFlare which is really limited so maybe i can do it in Route53 or FlexBalancer.

What do you guys think?

Thanks a lot for helping me out.

vovler · July 4, 2020, 2:57pm

So you have 2 ways of doing it.

You can code it in, as you mentioned with something like PHP.
Use a DNS service that allows you to load balance your services.

There is multiple ways you can implement each solution, here are some of them:

Cron a PHP script that runs like every 5 seconds that gets each edge video server’s bandwidth consumption. Throw those numbers into MySQL.
Lets say you are using videojs player, just use https://redirecter.mydomain.com/video.php?id=1234ASD as your video source.
This tiny script would get the visitor’s location,read the latest bandwidth usage from MySQL, compare it to the max usage of the server, and would 302 redirect the user to the least full and closest server, lets say to something like https://us-la2.mydomain.com/video/123ASD.mp4
DNS load balancing is a bit different. Having a service that load balances with health checks becomes quite costly as they tend to charge you per server. So the more servers you the more you’ll pay. You can always round-robin for free, but this won’t help you if you have 10Gbit and 1Gbit servers in the same mix.
WIth Perfops Flexbalancer you could do Geo Load Balancing with round-robin, the benefit is that you would have servers with different throughput in different regions. You’re still forced to have the same throughput in each region ,so lets say, all USA locations would have 10Gbit and all European locations would have only 1Gbit. But it’s already an improvement over the simple round-robin.

Personally I’d go with the 1st option, you have more control over everything, it’d take about 3-4 hours to code a simple version of it and your client would save a lot of $ in the long run.

You can easily run a Hetzner storage server with all the videos, run a bunch of nginx reserve proxy servers that have a cheap CPU, low RAM, and a decently sized SSD for the cache. Set the cache to almost never expire, let nginx fill most of your SSD and let it purge the least accessed file when it needs some space.
Each of those edge servers would be running a script that reports the bandwidth usage when the main server asks for it.
The redirector script could look for the closest servers, check if the servers are up and if they’re busy or free (lets say < 90% bandwidth usage)
You could even have the redirector script pick the server that is most likely to have the video already cached (if there was a recent redirect to that server looking for the same video) in case there are 2 or more servers that are free. This would help you reduce the traffic at your storage server and the visitor would get a faster/snappier loading video.

Pretty sure Google/Youtube does something quite similar with redirector.googlevideo.com, except they have many more edge servers, and many more storage servers, and maybe even some 2+ tier reverse proxy caching stuff going on

pbx · July 5, 2020, 2:38pm

We can imagine a mix between 1 and 2: have a database with all servers bandwidth consumption and use geodns and round robin:

When you see that you are near max capacity in a region (eg US-west) add a server there. If you can manage your DNS using an API it’s fairly easy to add a server to the round robin when your are close to capacity and to remove one (or more) when all servers are using < 70 of their bandwidth capacity. Of you’d have to adapt the numbers to your traffic patterns, if a huge spike can happen very quickly, it might be worth it to keep more servers online, even when there isn’t that much traffic. It would even be possible to have your script launch one or several new reverse proxy instance(s) on the fly using cloud providers APIs when really needed (for example a big event, that many people start watching). When everything is quiet you send your visitors to 1 or two servers, when you need more power you automatically launch reverse proxies, that will be shut down after a couple of hours of low traffic. Bill stays low if there is not much traffic for days, and your setup can handle a virtually unlimited number or viewers, scaling up automatically when it’s needed.

vovler · July 5, 2020, 2:53pm

Anyone knows how DO/Vultr/Linode/Hetzner bill bandwidth consumption?

For the cheapest instance they offer 1TB of bandwidth (except Hetzner that offers 20TB), if you delete an instance and spin up another one, does the bandwidth usage reset back to 0?

Seems easy to abuse, cause you could easily get “unlimited” bandwidth if you delete an instance and recreate it like every hour or so.

vovler · July 5, 2020, 3:17pm

The man, the myth, the legend himself

ac_fan · July 5, 2020, 5:45pm

The transfer allowance is spread over a month.
Scenarios:

Use bandwidth over entire month. Cost as expected.
Use bandwidth in the beginning and keep VPS for entire month. Cost as expected.
Use bandwidth in the beginning and delete VPS. Cost is hourly rate of VPS plus bandwidth costs.
Bandwidth costs are (total usage - amount of prorated allowance) * overage charges.

Hope that clears it up!

Poli · July 6, 2020, 8:23am

Thanks a lot for all your answers, for hetzner yes it resets at every instance you remove and add again you could definitely abuse it in some way.

But I think I will simply have fixed pricing for the machines and do not do any on-demand thing, because since it’s live streaming there isn’t any good opensource solution for it except SRS (OpenSRS streaming) and every time you need to add a node you need to hard code it to add it sadly.

I am thinking to work with multiple edges (i think at first 4 edges and 2 origins).
I will try to create the first solution but I could see some issues with it, imagine we suddenly have 1k users joining, the script is run only every 5 seconds so everyone will go on the same machine (or on 2-4 machines). Wouldn’t it lag out a whole bunch of peoples because there will be too many peoples on the same machines?

Poli · July 6, 2020, 8:39am

For the PHP script, I was wondering how you could find the closest server. How can I know the position of the person without asking for geo-location, or using locales which is not precise?

Clouvider · July 6, 2020, 8:54am

You can either use the geolocation or you can ask the visitor to select the region when joining the stream.

Poli · July 6, 2020, 11:05am

Mhmm very intressting the flexbalancer looks a better solution at my eyes but is it possible to have multiple pools per region (for exemple two in germany)?

mrclown · July 6, 2020, 11:52am

Yes flexbalancer support scripting mode too where you can change as you needed in NS level.

ReliableSite · July 9, 2020, 1:55am

Did I hear someone say 100G?

On a more serious note, you also have to consider whether you’ll have a backbone or not. While a backbone seems like the better solution, it can be a huge burden to maintain and increase costs significantly.

Daniel · July 12, 2020, 7:09pm

Depending on how many servers you’ve got, MySQL doesn’t like some very write-heavy loads. In this case you’re just using the data as a cache that can easily be refilled (by querying all the servers again), you can probably just keep the data in memory somewhere rather than involving MySQL. No need to add the overhead of persisting the data to disk.