UT HPC LLM Inference API Guide¶

The UT HPC Center provides an OpenAI-compatible API serving multiple models. The endpoint is available at:

https://llm.hpc.ut.ee/v1

Getting access¶

Contact the HPC Center admin to obtain a personal API key. Your key is used for authentication and usage tracking.

Available models¶

Model name	Base model	Best for
`qwen3.5-122b`	Qwen3.5-122B-A10B (122B MoE, 10B active, INT4)	Complex reasoning, math, coding, analysis. Thinking enabled.
`qwen3.5-122b-nonthinking`	Qwen3.5-122B-A10B (122B MoE, 10B active, INT4)	Faster responses, simple Q&A, translation, summarization. Thinking disabled.
`gemma-4-31B-it`	Gemma 4 31B IT (31B dense)	General-purpose tasks, instruction following, multilingual
`whisper-large-v3`	OpenAI's Whisper Large V3 (1.55B parameters)	Transcription
`qwen3-embedding-8B`	Alibaba's Qwen3-Embedding-8B (8B dense, 4096-dim vector)	Text retrieval, semantic search, RAG pipelines, and cross-lingual/multilingual classification.

Examples¶

curl¶

curl https://llm.hpc.ut.ee/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-122b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain the CAP theorem in two sentences."}
    ],
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 1024
  }'

curl image¶

curl https://llm.hpc.ut.ee/v1/chat/completions \
   -H "Authorization: Bearer YOUR_API_KEY" \
   -H "Content-Type: application/json" \
   -d '{
    "model": "qwen3.5-122b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What is in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://hpc.ut.ee/assets/images/server_room_3.png"
            }
          }
        ]
      }
    ],
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 1024
  }'

curl transcription¶

curl https://llm.hpc.ut.ee/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "model=whisper-large-v3" \
  -F "language=et" \
  -F "file=@/path/to/clip.mp3"

Python (OpenAI package)¶

Install the SDK:

pip install openai

Basic request:

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hpc.ut.ee/v1",
    api_key="YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-122b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the CAP theorem in two sentences."},
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=1024,
)

print(response.choices[0].message.content)

Streaming:

import httpx
from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hpc.ut.ee/v1",
    api_key="YOUR_API_KEY",
    timeout=httpx.Timeout(300.0, connect=10.0),
)

stream = client.chat.completions.create(
    model="qwen3.5-122b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a short poem about distributed systems."},
    ],
    stream=True,
    temperature=0.7,
    top_p=0.8,
    max_tokens=1024,
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Python (requests)¶

import requests

response = requests.post(
    "https://llm.hpc.ut.ee/v1/chat/completions",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "qwen3.5-122b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain the CAP theorem in two sentences."},
        ],
        "temperature": 0.7,
        "top_p": 0.8,
        "max_tokens": 1024,
    },
)

print(response.json()["choices"][0]["message"]["content"])

Recommended parameters¶

Qwen3.5-122B¶

Parameter	Value	Description
`temperature`	`0.7`	Controls randomness. Lower values are more deterministic.
`top_p`	`0.8`	Nucleus sampling — considers tokens within the top 80% probability mass.
`max_tokens`	`1024`+	Maximum tokens to generate. Increase for longer responses (max 131072).
`presence_penalty`	`1.5`	Reduces repetition by penalizing already-used tokens.

Gemma 4 31B IT¶

Parameter	Value	Description
`temperature`	`1.0`	Default recommended by Google.
`top_p`	`0.95`	Nucleus sampling — considers tokens within the top 95% probability mass.
`top_k`	`64`	Limits sampling to the top 64 tokens.
`max_tokens`	`1024`+	Maximum tokens to generate. Increase for longer responses (max 131072).

Adjust based on your use case — lower temperature for factual tasks, higher for creative ones.

Rate limits¶

Rate limits are configured per API key. If you receive HTTP 429 responses, you are being rate-limited. Contact the admin to adjust your limits.

Compatibility¶

The API follows the OpenAI Chat Completions API format. Any OpenAI-compatible client library or tool should work — just set the base URL to https://llm.hpc.ut.ee/v1 and use your API key.