Skip to content

UT HPC LLM Inference API Guide

The UT HPC Center provides an OpenAI-compatible API serving multiple models. The endpoint is available at:

https://llm.hpc.ut.ee/v1

A web chat interface is also available at https://chat.hpc.ut.ee/ for interactive use without writing code.

Getting access

Contact the HPC Center admin to obtain a personal API key. Your key is used for authentication and usage tracking.

Available models

Model name Base model Best for
qwen3.5-122b Qwen3.5-122B-A10B (122B MoE, 10B active, INT4) Complex reasoning, math, coding, analysis. Thinking enabled.
qwen3.5-122b-nonthinking Qwen3.5-122B-A10B (122B MoE, 10B active, INT4) Faster responses, simple Q&A, translation, summarization. Thinking disabled.
gemma-4-31B-it Gemma 4 31B IT (31B dense) General-purpose tasks, instruction following, multilingual
whisper-large-v3 OpenAI's Whisper Large V3 (1.55B parameters) Transcription

Examples

curl

curl https://llm.hpc.ut.ee/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-122b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain the CAP theorem in two sentences."}
    ],
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 1024
  }'

curl transcription

curl https://llm.hpc.ut.ee/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "model=whisper-large-v3" \
  -F "language=et" \
  -F "file=@/path/to/clip.mp3"

Python (OpenAI package)

Install the SDK:

pip install openai

Basic request:

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hpc.ut.ee/v1",
    api_key="YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-122b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the CAP theorem in two sentences."},
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=1024,
)

print(response.choices[0].message.content)

Streaming:

import httpx
from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hpc.ut.ee/v1",
    api_key="YOUR_API_KEY",
    timeout=httpx.Timeout(300.0, connect=10.0),
)

stream = client.chat.completions.create(
    model="qwen3.5-122b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a short poem about distributed systems."},
    ],
    stream=True,
    temperature=0.7,
    top_p=0.8,
    max_tokens=1024,
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Python (requests)

import requests

response = requests.post(
    "https://llm.hpc.ut.ee/v1/chat/completions",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "qwen3.5-122b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain the CAP theorem in two sentences."},
        ],
        "temperature": 0.7,
        "top_p": 0.8,
        "max_tokens": 1024,
    },
)

print(response.json()["choices"][0]["message"]["content"])

Qwen3.5-122B

Parameter Value Description
temperature 0.7 Controls randomness. Lower values are more deterministic.
top_p 0.8 Nucleus sampling — considers tokens within the top 80% probability mass.
max_tokens 1024+ Maximum tokens to generate. Increase for longer responses (max 131072).
presence_penalty 1.5 Reduces repetition by penalizing already-used tokens.

Gemma 4 31B IT

Parameter Value Description
temperature 1.0 Default recommended by Google.
top_p 0.95 Nucleus sampling — considers tokens within the top 95% probability mass.
top_k 64 Limits sampling to the top 64 tokens.
max_tokens 1024+ Maximum tokens to generate. Increase for longer responses (max 131072).

Adjust based on your use case — lower temperature for factual tasks, higher for creative ones.

Rate limits

Rate limits are configured per API key. If you receive HTTP 429 responses, you are being rate-limited. Contact the admin to adjust your limits.

Compatibility

The API follows the OpenAI Chat Completions API format. Any OpenAI-compatible client library or tool should work — just set the base URL to https://llm.hpc.ut.ee/v1 and use your API key.