For many startups, video transcoding is a compute intensive task…
Many rely on cloud services while others opt to build their own cluster or farms for encoding video. If you have an office of 60 macs, you can turn those workstations into encoding nodes that run at off-hours or during idle times. In this scenario, it is more cost effective to build an in house farm than rely on cloud services. The concept of a render farm is rather simple – a cluster of machines dedicated to doing a specific task of converting videos. The idea of building a transcoding render farm requires a diverse set of disciplines – application programming, database schema design, and provisioning of compute services.
If you want to build the next Vimeo or Youtube, here is a high-level practical guide. The idea of a transcoding cluster is not limited to just video, this practical guide is useful for image manipulation, audio processing, bulk automation or any compute intensive process needed for a web service. If you want to build an army of Twitter bots, parts of this guide will be useful. In short, this guide will provide a high-level roadmap to build your own farm. Since each use case scenarios will be different, you will need to adapt your database and application to fit your needs.
First, some definitions. A Node is a worker machine in your cluster. You can have multiple nodes in a farm. The main node is called a Traffic Controller. The job of the traffic controller is to delegate and monitor jobs. The traffic controller spools up a Queue, which is the list of jobs to be processed. Lastly, there is the Repo, short for repository, where the files are stored. A repo can be a network share, a cloud store like an Amazon S3 bucket, or a local path. It is simply where the files are stored.
The Traffic controller will be the brains of the operation and it could be any platform you choose it to be. In this example, a simple LAMP stack will suffice. Some prefer to implement a Node.JS or Python/Bash system. Whatever you choose, you will need a database. In this example, a schema for the MySQL database will contain the following baseline minimum: An asset table and a queue table. The asset table stores information about the file, location, and other attributes. Here is an example of an asset table that supports multiple file types. If you were building an Instagram bot, you’d replace the asset table with a user table.
The queue table is how the files will be delegated. In the queue table, there will be a reference by ID to the asset table, an output path, start/end times of a task, a complete status, the PID and origin of the node. There may be additional attributes — post render actions, where to pull the conversion parameters to pass to FFMPEG, or responses to event failures.
The important thing to note are PID and node origin. The PID tells us the process ID of the task running and the node_name tells us which client is doing the encoding. In the event of a failure, having the system PID (Process ID), the traffic controller can issue a command to the remote node to kill the task.
The job of the traffic controller is very simple. It’s primary purpose is to run the spool of the running jobs in the queue and provide a list. It acts as the manager, serving out the database list of open jobs when a node asks for it. It also record actions by the node. In a small cluster, the Traffic controller may be the main server that digest and stores the files.
When a file is uploaded, it is recorded in the asset table and a new record is inserted into the queue table. The traffic controller may do additional things such as post processing. For example, when a video is uploaded, it may run 3 possible queues: thumbnail generation, a series of previews in sequence, and the final encoding in a web friendly format. When the first task of thumbnail is generated, the traffic controller may be required to spawn new consecutive tasks/queues.
The following screenshot is an example of small transcoding schema with the asset table, a queue, metadata (for keywords) and a config table. The config table, in this example, shows various transcoding and support for video, PDF, to Image manipulation. Thus, transcoding does not have to be germane to just video.
Here is an example of a queue table (called job) in an Instagram bot application. The same principle applies – a list of tasks. The same concept here can be used for mass level bot automation. For an Instagram bot, you would replace the transcode_config with a scope config table for things like auto-liking, follow, un-follow, scrape tasks.
Now to the render nodes. Nodes are purpose built task workers with minimal OS builds. Often lightweight linux VMs (Virtual Machines) or builds provisioned for just encoding, the nodes run an application layer that communicates with the traffic controller. The app is often a runner script that gets the instructions and enacts whatever necessary command(s). Provisioning the build requires building a common environment for all the nodes. The subject of provisioning is out of the scope of this article. In short, you want to build a lightweight client that can be duplicated dozens to hundreds of times. In the past, I’ve deployed small Debian builds running python or small LAMP stacks with the necessary binaries of FFMPEG and possibly other binaries such as GhostScript, ImageMagick for different types of encoding. The builds are OVF Virtual Machine guests. Others have opted to use Docker and Vagrant. Choose whatever works for you. The only post installation is the setting of the hostnames. SSH keys, network mount points, and tokens are all baked into the builds for rapid deployment.
The template of these builds are rather straightforward. There is a init.d daemon startup process that invokes a runner script. This script runs in a loop and logs actions to log files. Some opt for a continuous cron job that invokes the runner script at intervals while others run them in a repeating loop.This template can be used to create other automation such as an Instagram or Twitter bot. Automation bots or nodes will employ the same basic premise. The runner script uses simple webhooks, cURL or an API layer to get it’s list from the traffic controller.
Here is a Debian example. The path name should be self explanatory. The init script processor runs a looping shell script that write logs files. Lastly, it invokes a python or PHP shellrunner script.
Multiple Nodes poll the traffic controller for open jobs and when a file is available, the first free node will claim the job and initiate the rendering. Here is a visual diagram with the steps involved. When a file is uploaded into a system, it goes through the following steps. 1. The asset is uploaded to the repo and written to a database. 2. That file is then written to a queue. 3. Concurrently, multiple nodes are polling for open jobs and a single free node will render a job that it sees open. 4. Once complete, the node informs the Traffic controller. 5. The last step usually involves making another record into the database linking the rendered file to the master file.
In-between these steps, problems can occur. Strange and malformed files may cause a stray render that never finishes. A node could be completely locked up. In the event of a failure, a traffic controller can kill the queue, re-initiate it, or delegate it to another node. Thus it is important in the schema design of the queue table to cover open jobs, active status, start/end times, and the PID. With the known PID, a node can be instructed to kill the rogue tasks. Below is a visual representation of what happens after a file is loaded. The command line shell shows the running log while the database client shows the file being written to queue. In the shellrunner console, the status response is usually empty or “nojobs.” But when a new record is inserted in the queue, an available node knows it needs to make a thumbnail preview.
An Example of Transcoding using PHP
When a node gets an instruction to transcode video, it may run different types of encoding or transcoding. Often, it is good to have a look-up table to with the binaries and the parameters.
The screenshot above shows some examples for video, a MP4 podcast converter, and exploding PDFs. For example, if you wanted to convert an original video file into a MP3 suitable for iTune Podcasts, record 7 has the necessary settings and which binary to run.
|$ffmpeg_arg = ” -i [inpath]
-vn -acodec libmp3lame -ac 2 -ab 160k -ar 48000
“; // from database
When a node polls the transcoder, it knows the source file and the application also has the logic to create output files. In PHP, it runs a background shell command and returns the PID which is sent to the traffic controller. Here is a PHP example, it pulls the transcode function from the database, replaces the default placeholder paths and executes the command with a return PID.
|$transcode_cli = “/usr/bin/ffmpeg” . $ffmpeg_arg ;|
|$transcode_cli = str_replace(” [inpath]
|$transcode_cli = str_replace(” [outpath]
|$transcode_cli .= ” > /dev/null 2>&1 & echo $! “;|
|$pid = exec($transcode_cli, $output);|
There are some scenarios where the node can output the running process and relay that back to the traffic controller where an Admin can actually view the verbose logging to see the running progress.
Practical real world usage
Here is an example of a markup application that allows users to scrub across frames of individual files. As the user specifies a point in time, the transcoder spools a queue to generate 15 frames for preview. On a single machine, the re-draw can take some time and running background ffmpeg processes will bog down your primary server. But with an army at it’s disposal, multiple preview snapshots are created in the background.
In closing, this guide illustrates that the notion of a render/cluster system is very basic. The formula is simple in form and rather easy to set-up if you know where to begin. Now, how you deploy your model is up to you.