Skip to content

LLM Inference API Guide

The UT HPC Center provides an OpenAI-compatible LLM API with access to chat, transcription, and embedding models.

Base URL: https://llm.hpc.ut.ee/v1

Getting Access

API Key Required

Contact support@hpc.ut.ee to request a personal API key. Keys are used for authentication and usage tracking — keep yours confidential and do not share it.

Available Models

Chat & Reasoning

Model name Base model Best for
qwen3.5-122b Qwen3.5 122B MoE (10B active, INT4) Complex reasoning, math, coding, analysis — thinking enabled
qwen3.5-122b-nonthinking Qwen3.5 122B MoE (10B active, INT4) Fast responses, Q&A, translation, summarization — thinking disabled
gemma-4-31B-it Gemma 4 31B IT (dense) General-purpose tasks, instruction following, multilingual

Transcription

Model name Base model Best for
whisper-large-v3 OpenAI Whisper Large V3 (1.55B) Audio transcription, multilingual speech-to-text

Embeddings

Model name Base model Dimensions Best for
qwen3-embedding-8B Qwen3-Embedding-8B (dense) 4096 Text retrieval, semantic search, RAG pipelines, multilingual classification

Examples

Chat Completion

curl https://llm.hpc.ut.ee/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-122b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain the CAP theorem in two sentences."}
    ],
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 1024
  }'

Install the SDK if you haven't already:

pip install openai

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hpc.ut.ee/v1",
    api_key="YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-122b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the CAP theorem in two sentences."},
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=1024,
)

print(response.choices[0].message.content)
import requests

response = requests.post(
    "https://llm.hpc.ut.ee/v1/chat/completions",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "qwen3.5-122b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain the CAP theorem in two sentences."},
        ],
        "temperature": 0.7,
        "top_p": 0.8,
        "max_tokens": 1024,
    },
)

print(response.json()["choices"][0]["message"]["content"])

Streaming

Streaming returns tokens incrementally as they are generated, rather than waiting for the full response.

curl https://llm.hpc.ut.ee/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-122b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Write a short poem about distributed systems."}
    ],
    "stream": true,
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 1024
  }'
import httpx
from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hpc.ut.ee/v1",
    api_key="YOUR_API_KEY",
    timeout=httpx.Timeout(300.0, connect=10.0),  # (1)!
)

stream = client.chat.completions.create(
    model="qwen3.5-122b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a short poem about distributed systems."},
    ],
    stream=True,
    temperature=0.7,
    top_p=0.8,
    max_tokens=1024,
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()
  1. A generous read timeout is recommended for streaming — long responses may take several minutes to complete.

Image Understanding

Pass an image URL alongside your text prompt. Supported by the qwen3.5-122b model.

curl https://llm.hpc.ut.ee/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-122b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {
            "type": "image_url",
            "image_url": {
              "url": "https://hpc.ut.ee/assets/images/server_room_3.png"
            }
          }
        ]
      }
    ],
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 1024
  }'
from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hpc.ut.ee/v1",
    api_key="YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-122b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://hpc.ut.ee/assets/images/server_room_3.png"
                    },
                },
            ],
        },
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=1024,
)

print(response.choices[0].message.content)

Transcription

Convert audio files to text using whisper-large-v3. Use the language parameter to improve accuracy for a known language.

curl https://llm.hpc.ut.ee/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "model=whisper-large-v3" \
  -F "language=et" \
  -F "file=@/path/to/clip.mp3"
from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hpc.ut.ee/v1",
    api_key="YOUR_API_KEY",
)

with open("/path/to/clip.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=audio_file,
        language="et",  # (1)!
    )

print(transcription.text)
  1. Optional. Providing the language (ISO-639-1 code) improves speed and accuracy. Omit to enable automatic language detection.

Embeddings

Generate vector representations of text for use in semantic search, RAG pipelines, or classification.

curl https://llm.hpc.ut.ee/v1/embeddings \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-embedding-8B",
    "input": "The University of Tartu HPC cluster provides GPU compute resources."
  }'
from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hpc.ut.ee/v1",
    api_key="YOUR_API_KEY",
)

response = client.embeddings.create(
    model="qwen3-embedding-8B",
    input="The University of Tartu HPC cluster provides GPU compute resources.",
)

vector = response.data[0].embedding  # 4096-dimensional float list
print(f"Embedding dimensions: {len(vector)}")

Tip

Use a lower temperature for factual, analytical, or structured tasks. Use a higher temperature for creative or generative ones.

Parameter Recommended value Description
temperature 0.7 Controls randomness. Lower = more deterministic.
top_p 0.8 Nucleus sampling — sample from the top 80% probability mass.
presence_penalty 1.5 Penalises already-used tokens to reduce repetition.
max_tokens 1024+ Maximum tokens to generate. Up to 131,072.
Parameter Recommended value Description
temperature 1.0 Default recommended by Google.
top_p 0.95 Nucleus sampling — sample from the top 95% probability mass.
top_k 64 Limits sampling to the top 64 tokens at each step.
max_tokens 1024+ Maximum tokens to generate. Up to 131,072.

Rate Limits

HTTP 429 — Too Many Requests

Rate limits are enforced per API key. If you receive 429 responses, you have exceeded your quota. Contact support@hpc.ut.ee to review or increase your limits.

Compatibility

The API is fully compatible with the OpenAI API specification. Any OpenAI-compatible library or tool works out of the box — set the base URL to https://llm.hpc.ut.ee/v1 and provide your API key.

Compatible tools & frameworks

  • LangChain — use ChatOpenAI with base_url="https://llm.hpc.ut.ee/v1"
  • LlamaIndex — use the OpenAI LLM class with a custom api_base
  • Continue — configure as an OpenAI-compatible provider in your IDE
  • Any tool supporting the OPENAI_BASE_URL and OPENAI_API_KEY environment variables