Skip to main content

Enter Model Details

Model Name

Provide a descriptive name for your model to identify it within your workspace.

Source

Specify where the model will be fetched from. Choose from the following options:

Public Sources

  • HuggingFace Model Hub
    Provide the repository path in the format creator/model-slug
    Example: meta-llama/Llama-3.2-3B-Instruct
  • Public URL
    Provide a direct, publicly accessible download link to the model
  • Visit huggingface.co.
  • Use the search bar to find the desired model. (e.g., “whisper-large”)
  • Click on the model you want from the search results. (e.g., openai/whisper-large-v3-turbo)
  • Copy the model path displayed at the top of the page (e.g., openai/whisper-large-v3-turbo) for use.
The model path on HuggingFace follows the format: creator/model-slug.

Cloud Storage

Cloud storage sources require authentication credentials (configured as Secrets in your workspace).
  • AWS S3
    Enter the S3 bucket path (e.g., s3://my-bucket/models/my-model)
  • GCP GCS
    Enter the Google Cloud Storage bucket path (e.g., gs://my-bucket/models/my-model)
  • Shakti Cloud S3
    Provide the Shakti Cloud S3 path where your model is stored (e.g., s3://my-bucket/models/my-model)

Model Class

Select the appropriate model class based on your model architecture (e.g., LlamaForCausalLM for Llama-series models).
This field is automatically populated when importing models from HuggingFace.
Screenshot 2025-08-18 at 4.02.05 PM.png
Note: Only instruct-style models are supported in the model compilation step for LLMs. These are typically chat-optimized models and are often identified by the suffix -Instruct in their names (e.g., meta-llama/Llama-3.2-3B-Instruct).Base models such as meta-llama/Llama-3.2-3B (without the -Instruct suffix) are not supported.

Optimizing Infrastructure

  • Configure the infrastructure to optimize the model’s performance, such as selecting the appropriate compute resources and optimization techniques.

Configuration

  • Select the desired quantization format: FP16 or AWQ (based on your performance and resource requirements)
    • FP16 (Half-Precision): Offers higher precision and accuracy, but requires more GPU memory and compute power.
    • AWQ (Activation-aware Weight Quantization): Reduces model size and memory usage with minimal impact on accuracy, making it suitable for resource-constrained environments.
  • The optimization, model, and pipeline configurations are auto-filled based on the details provided earlier. You may modify them if required to suit your deployment needs.
  • Finalize the model’s configuration by setting any additional parameters or preferences required for deployment.
Screenshot 2025-08-08 at 5.58.14 PM.png