Optimise a Model

Enter Model Details

Model Name

Provide a descriptive name for your model to identify it within your workspace.

Source

Specify where the model will be fetched from. Choose from the following options:

Public Sources

HuggingFace Model Hub
Provide the repository path in the format creator/model-slug
Example: meta-llama/Llama-3.2-3B-Instruct
Public URL
Provide a direct, publicly accessible download link to the model

Getting your model path from HuggingFace

Visit huggingface.co.
Use the search bar to find the desired model. (e.g., “whisper-large”)
Click on the model you want from the search results. (e.g., openai/whisper-large-v3-turbo)
Copy the model path displayed at the top of the page (e.g., openai/whisper-large-v3-turbo) for use.

The model path on HuggingFace follows the format: creator/model-slug.

Cloud Storage

Cloud storage sources require authentication credentials (configured as Secrets in your workspace).

AWS S3
Enter the S3 bucket path (e.g., s3://my-bucket/models/my-model)
GCP GCS
Enter the Google Cloud Storage bucket path (e.g., gs://my-bucket/models/my-model)
Shakti Cloud S3
Provide the Shakti Cloud S3 path where your model is stored (e.g., s3://my-bucket/models/my-model)

Model Class

Select the appropriate model class based on your model architecture (e.g., LlamaForCausalLM for Llama-series models).

This field is automatically populated when importing models from HuggingFace.

Note: Only instruct-style models are supported in the model compilation step for LLMs. These are typically chat-optimized models and are often identified by the suffix -Instruct in their names (e.g., meta-llama/Llama-3.2-3B-Instruct).Base models such as meta-llama/Llama-3.2-3B (without the -Instruct suffix) are not supported.

Optimizing Infrastructure

Configure the infrastructure to optimize the model’s performance, such as selecting the appropriate compute resources and optimization techniques.

Configuration

Select the desired quantization format: FP16 or AWQ (based on your performance and resource requirements)
- FP16 (Half-Precision): Offers higher precision and accuracy, but requires more GPU memory and compute power.
- AWQ (Activation-aware Weight Quantization): Reduces model size and memory usage with minimal impact on accuracy, making it suitable for resource-constrained environments.
The optimization, model, and pipeline configurations are auto-filled based on the details provided earlier. You may modify them if required to suit your deployment needs.
Finalize the model’s configuration by setting any additional parameters or preferences required for deployment.

Get Started

Types of Inference

Secrets

Playground

Model Compilation

Deployment

Training

Settings

References

Enter Model Details

Model Name

Source

Public Sources

Cloud Storage

Model Class

Optimizing Infrastructure

Configuration

Get Started

Types of Inference

Secrets

Playground

Model Compilation

Deployment

Training

Settings

References

​Enter Model Details

​Model Name

​Source

​Public Sources

​Cloud Storage

​Model Class

​Optimizing Infrastructure

​Configuration

Enter Model Details

Model Name

Source

Public Sources

Cloud Storage

Model Class

Optimizing Infrastructure

Configuration