UT HPC LLM Inference API Guide¶
The UT HPC Center provides an OpenAI-compatible API serving multiple models. The endpoint is available at:
https://llm.hpc.ut.ee/v1
A web chat interface is also available at https://chat.hpc.ut.ee/ for interactive use without writing code.
Getting access¶
Contact the HPC Center admin to obtain a personal API key. Your key is used for authentication and usage tracking.
Available models¶
| Model name | Base model | Best for |
|---|---|---|
qwen3.5-122b | Qwen3.5-122B-A10B (122B MoE, 10B active, INT4) | Complex reasoning, math, coding, analysis. Thinking enabled. |
qwen3.5-122b-nonthinking | Qwen3.5-122B-A10B (122B MoE, 10B active, INT4) | Faster responses, simple Q&A, translation, summarization. Thinking disabled. |
gemma-4-31B-it | Gemma 4 31B IT (31B dense) | General-purpose tasks, instruction following, multilingual |
whisper-large-v3 | OpenAI's Whisper Large V3 (1.55B parameters) | Transcription |
Examples¶
curl¶
curl https://llm.hpc.ut.ee/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-122b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the CAP theorem in two sentences."}
],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024
}'
curl transcription¶
curl https://llm.hpc.ut.ee/v1/audio/transcriptions \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "model=whisper-large-v3" \
-F "language=et" \
-F "file=@/path/to/clip.mp3"
Python (OpenAI package)¶
Install the SDK:
pip install openai
Basic request:
from openai import OpenAI
client = OpenAI(
base_url="https://llm.hpc.ut.ee/v1",
api_key="YOUR_API_KEY",
)
response = client.chat.completions.create(
model="qwen3.5-122b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the CAP theorem in two sentences."},
],
temperature=0.7,
top_p=0.8,
max_tokens=1024,
)
print(response.choices[0].message.content)
Streaming:
import httpx
from openai import OpenAI
client = OpenAI(
base_url="https://llm.hpc.ut.ee/v1",
api_key="YOUR_API_KEY",
timeout=httpx.Timeout(300.0, connect=10.0),
)
stream = client.chat.completions.create(
model="qwen3.5-122b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short poem about distributed systems."},
],
stream=True,
temperature=0.7,
top_p=0.8,
max_tokens=1024,
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
Python (requests)¶
import requests
response = requests.post(
"https://llm.hpc.ut.ee/v1/chat/completions",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"model": "qwen3.5-122b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the CAP theorem in two sentences."},
],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024,
},
)
print(response.json()["choices"][0]["message"]["content"])
Recommended parameters¶
Qwen3.5-122B¶
| Parameter | Value | Description |
|---|---|---|
temperature | 0.7 | Controls randomness. Lower values are more deterministic. |
top_p | 0.8 | Nucleus sampling — considers tokens within the top 80% probability mass. |
max_tokens | 1024+ | Maximum tokens to generate. Increase for longer responses (max 131072). |
presence_penalty | 1.5 | Reduces repetition by penalizing already-used tokens. |
Gemma 4 31B IT¶
| Parameter | Value | Description |
|---|---|---|
temperature | 1.0 | Default recommended by Google. |
top_p | 0.95 | Nucleus sampling — considers tokens within the top 95% probability mass. |
top_k | 64 | Limits sampling to the top 64 tokens. |
max_tokens | 1024+ | Maximum tokens to generate. Increase for longer responses (max 131072). |
Adjust based on your use case — lower temperature for factual tasks, higher for creative ones.
Rate limits¶
Rate limits are configured per API key. If you receive HTTP 429 responses, you are being rate-limited. Contact the admin to adjust your limits.
Compatibility¶
The API follows the OpenAI Chat Completions API format. Any OpenAI-compatible client library or tool should work — just set the base URL to https://llm.hpc.ut.ee/v1 and use your API key.