Skip to main content

Required Infrastructure Configuration

ParameterValueNotes
Accelerator TypeModel-dependentSelect a type based on model size (see below)
✅ Always choose the (any) variant for better availability (if you have selected the managed infrastructure option as your cluster choice)

What is Tensor Parallelism (TP)?

Tensor Parallelism allows a model’s computation to be split across multiple GPUs or devices, enabling:
  • Faster inference for large models
  • Support for models too large to fit on a single device
tensor_parallel_size controls how many devices the computation is split across:
  • 1 = No tensor parallelism (single GPU)
  • 2, 4, 8, etc. = Enable TP across multiple GPUs
You must have at least as many GPUs as the value set in tensor_parallel_size.

Model Mode Types and Compatibility

Different model versions require specific modes to function correctly. When selecting models from Hugging Face, you’ll typically find two variants:
  • Base models (e.g., meta-llama/Meta-Llama-3-8B) - designed for completion tasks.
  • Instruct models (e.g., meta-llama/Meta-Llama-3-8B-Instruct) - optimized for conversational/chat interactions.
Configuration Rule: The pipeline mode must match the model type, otherwise compilation will fail. For instance, using a base completion model like meta-llama/Meta-Llama-3-8B with chat mode will cause errors - you must set the mode to completion instead. Pipeline Mode Settings: Configure the mode in your pipeline based on your intended use case:
  • Set mode to embedding when compiling embedding models
  • Set mode to chat when compiling conversational/instruct models
  • Set mode to completion when compiling base/completion models
The key is ensuring alignment between your model choice and the corresponding pipeline mode configuration.

Instance Type Selection Guide

Use the table below to guide instance and TP configuration based on your model size:
ModelRecommended Instance TypeSuggested TPNotes
Gemma 2BL40s (any)1Lightweight model; fits on a single GPU
LLaMA 3BL40s (any)1Also fits on single GPU with headroom
Gemma 7B / LLaMA 8BH100 (any)2–4Benefits from multi-GPU setup
LLaMA 70BH100 (any)2–4+Requires high TP and multi-GPU infrastructure

Common Issues & Fixes

IssueCauseFix
🚫 Job stuck in queue or not scheduled (machine not available)Accelerator type too specific or unavailable.✅ Use the H100 variant of the accelerator type where available to improve scheduling flexibility.
🚫 Out of memory / crashesModel too large for a single GPU.✅ Increase tensor_parallel_size or switch to a higher-memory accelerator type / increase accelerator count.
🚫 TP value ignored or job fails to startTP set higher than available GPUs.✅ Ensure the total number of GPUs provisioned (accelerator count × GPUs per node) is the configured tensor_parallel_size.
🐢 Slow inferenceUnderutilized hardware or no parallelism.✅ Tune tensor_parallel_size and, where needed, use multi-GPU configurations to better utilize available accelerators.
🚫 Unsupported model type


Error message: The given model path is invalid
The model which you trying to compile is currently not supported by the Shakti Studio platform.Please raise a support ticket from your Shakti Studio account at Shakti Studio Support, and we will check the feasibility and add support for the model.
Mode Types in Pipeline config Model compilation failing even if the correct model and other parameters are selected.The selected model mode in the pipeline config may be incorrect.Pipeline Mode Settings: Configure the mode in your pipeline based on your intended use case:

- Set mode to embedding when compiling embedding models.

- Set mode to chat when compiling conversational/instruct models.

- Set mode to completion when compiling base/completion models. The key is ensuring alignment between your model choice and the corresponding pipeline mode configuration.
Machine Clean-up failed


Error Message: Cleanup Failure: Exception occured while cleaning up : Error cleaning up Azure resource group
Please raise a support ticket from your Shakti Studio account at Shakti Studio Support, and we will check the reason for the failure.Please raise a support ticket from your Shakti Studio account at Shakti Studio Support, and we will investigate and help you resolve the issue.

FAQs

  1. Can I edit the model name later? No, renaming a model after it’s been added and compiled is not supported. The model name must be set during the initial setup.

  1. What options are available for model sources? You can choose HuggingFace, where the base model is downloaded directly, or use AWS S3, GCS, or DockerHub by providing the appropriate path (S3 URL, GCS URL, or DockerHub registry link) along with the required credentials so we can retrieve your custom model.

  1. Do I need authentication keys for external sources? Yes, for model providers like Shakti Cloud S3, GCS, and DockerHub, you must add your authentication keys on the Secrets page in the Shakti Studio platform and use those credentials during the compilation process. For detailed instructions on how to add and manage secrets, see the Secrets documentation.

  1. Why am I getting a “The given model path is invalid” error? How do I verify if my model path is valid? While the Shakti Studio platform supports most LLM models, this error can occur if a particular model type isn’t supported yet. If you encounter this issue, please raise a ticket from your Shakti Studio account via Shakti Studio Support, and we’ll work on enabling support for your model.

  1. How do I link my AWS/GCP/Azure account? You will have to add your cloud account details in the integrations section. Refer to this doc on secrets.

  1. Can I use multiple cloud accounts? Yes, the Shakti Studio platform supports adding and managing multiple cloud accounts.

  1. How do I choose the right accelerator for my model?

    The appropriate accelerator depends largely on the size of your model. Larger models require higher-spec GPUs (and often higher accelerator counts) for optimization and deployment, while smaller models can run efficiently on mid-range accelerators.

  1. What happens if I run out of quota for GPUs in my Shakti Studio account?

    If your Shakti Studio account lacks sufficient GPU quota, the optimization job may fail to start or could get stuck partway through, leading to a failed optimization process.

  1. What does accelerator count mean?

    Accelerator count refers to the number of GPU instances allocated for a job. For example, if you select H100 as the accelerator and set the count to 2, two H100 machines will be provisioned. This is especially important when the tensor parallelism (TP) value is greater than 1.

  1. How do I know which accelerator configuration to select?

    In Shakti Studio, you primarily choose the accelerator type (for example, NVIDIA A10G, A100, or H100) and the accelerator count for your job. Larger models and higher tensor parallelism values generally benefit from higher-end GPUs and more accelerators, while smaller models can run efficiently on mid-range GPUs with fewer accelerators. Review your model size and TP settings, then pick an accelerator type and count that provide enough GPU memory and throughput for your workload.

  1. What is quantization and why should I use it? Quantization is the process of reducing the precision of the numbers used to represent a language model’s parameters (e.g., from 32-bit floating point to 8-bit integers) to make the model smaller and faster, with minimal loss in accuracy, helpful for running large language models (LLMs) efficiently

  1. Which quantization levels are supported?

    We support FP16, FP8, and AWQ quantization.

    Note that FP8 is not supported on Ampere architecture GPUs like A100 and A10G, as these devices do not natively support FP8 precision.

  1. Does quantization affect accuracy?

    Yes, quantization can result in a slight reduction in model accuracy.