How do Speech Containers work and how do I set them up?
Answer: When setting up a production cluster, there are several things to consider. Setting up single language and multiple containers on the same machine is supported. If you are experiencing problems, it may be a hardware-related issue - so we would first look at resource, that is; CPU and memory specifications.
Consider for a moment, the
ja-JP container and latest model. The acoustic model is the most demanding piece CPU-wise, while the language model demands the most memory. When we benchmarked the use, it takes about 0.6 CPU cores to process a single speech-to-text request when audio is flowing in at real-time (like from the microphone). If you are feeding audio faster than real-time (like from a file), that usage can double (1.2x cores). Meanwhile, the memory listed below is operating memory for decoding speech. It does not take into account the actual full size of the language model, which will reside in file cache. For
ja-JP that's an additional 2 GB; for
en-US, it may be more (6-7 GB).
If you have a machine where memory is scarce, and you are trying to deploy multiple languages on it, it is possible that file cache is full, and the OS is forced to page models in and out. For a running transcription, that could be disastrous, and may lead to slowdowns and other performance implications.
Furthermore, we pre-package executables for machines with the advanced vector extension (AVX2) instruction set. A machine with the AVX512 instruction set will require code generation for that target, and starting 10 containers for 10 languages may temporarily exhaust CPU. A message like this one will appear in the docker logs:
2020-01-16 16:46:54.981118943 [W:onnxruntime:Default, tvm_utils.cc:276 LoadTVMPackedFuncFromCache] Cannot find Scan4_llvm__mcpu_skylake_avx512 in cache, using JIT...
Finally, you can set the number of decoders you want inside a single container using
DECODER MAX_COUNT variable. So, basically, we should start with your SKU (CPU/memory), and we can suggest how to get the best out of it. A great starting point is referring to the recommended host machine resource specifications.
Can you help with capacity planning for on-prem Speech Containers?
Answer: For container capacity in batch processing mode, each decoder can handle 2-3x in real time, with two CPU cores, for a single recognition. We do not recommend keeping more than two concurrent recognitions per container instance, but recommend running more instances of containers for reliability/availability reasons, behind a load balancer.
Though we could have each container instance running with more decoders. For example, we may be able to set up 7 decoders per container instance on an eight core machine (at at more than 2x each), yielding 15x throughput. There is a param
DECODER_MAX_COUNT to be aware of. For the extreme case, reliability and latency issues arise, with throughput increased significantly. For a microphone, it will be at 1x real time. The overall usage should be at about one core for a single recognition.
For scenario of processing 1 K hours/day in batch processing mode, in an extreme case, 3 VMs could handle it within 24 hours but not guaranteed. To handle spike days, failover, update, and to provide minimum backup/BCP, we recommend 4-5 machines instead of 3 per cluster, and with 2+ clusters.
For hardware, we use standard Azure VM
DS13_v2 as a reference (each core must be 2.6 GHz or better, with AVX2 instruction set enabled).
The design reference includes two clusters of 5 VMs to handle 1 K hours/day audio batch processing.
When mapping to physical machine, a general estimation is 1 vCPU = 1 Physical CPU Core. In reality, 1vCPU is more powerful than a single core.
For on-prem, all of these additional factors come into play:
- On what type the physical CPU is and how many cores on it
- How many CPUs running together on the same box/machine
- How VMs are set up
- How hyper-threading / multi-threading is used
- How memory is shared
- The OS, etc.
Normally it is not as well tuned as the Azure environment. Considering other overhead, I would say a safe estimation is 10 physical CPU cores = 8 Azure vCPU. Though popular CPUs only have eight cores. With on-prem deployment, the cost will be higher than using Azure VMs. Also, consider the depreciation rate.
Service cost is the same as the online service.
What are the recommended resources, CPU, and RAM for 50 concurrent requests?
Answer: At real time, 8 with our latest
en-US, so we recommend using more docker containers beyond 6 concurrent requests. It gets crazier beyond 16 cores, and it becomes non-uniform memory access (NUMA) node sensitive. The following table describes the minimum and recommended allocation of resources for each Speech container.
|Speech-to-text||2 core, 2-GB memory||4 core, 4-GB memory|
- Each core must be at least 2.6 GHz or faster.
- For files, the throttling will be in the Speech SDK, at 2x (first 5 seconds of audio are not throttled).
- The decoder is capable of doing about 2-3x real time. For this, the overall CPU usage will be close to two cores for a single recognition. That's why we do not recommend keeping more than two active connections, per container instance. The extreme side would be to put about 10 decoders at 2x real time in an eight core machine like
DS13_V2. For the container version 1.3 and later, there's a param you could try setting
- For microphone, it will be at 1x real time. The overall usage should be at about one core for a single recognition.
Consider the total number of hours of audio you have. If the number is large, to improve reliability/availability, we suggest running more instances of containers, either on a single box or on multiple boxes, behind a load balancer. Orchestration could be done using Kubernetes (K8S) and Helm, or with Docker compose.
As an example, to handle 1000 hours/24 hours, we recommend setting up 3-4 VMs, with 10 instances/decoders per VM.
For more information on on-premise speech services, please contact our support team at firstname.lastname@example.org.