LLM Inference API Guide¶
The UT HPC Center provides an OpenAI-compatible LLM API with access to chat, transcription, and embedding models.
Base URL: https://llm.hpc.ut.ee/v1
Getting Access¶
API Key Required
Contact support@hpc.ut.ee to request a personal API key. Keys are used for authentication and usage tracking — keep yours confidential and do not share it.
Available Models¶
Chat & Reasoning¶
| Model name | Base model | Best for |
|---|---|---|
qwen3.5-122b | Qwen3.5 122B MoE (10B active, INT4) | Complex reasoning, math, coding, analysis — thinking enabled |
qwen3.5-122b-nonthinking | Qwen3.5 122B MoE (10B active, INT4) | Fast responses, Q&A, translation, summarization — thinking disabled |
gemma-4-31B-it | Gemma 4 31B IT (dense) | General-purpose tasks, instruction following, multilingual |
Transcription¶
| Model name | Base model | Best for |
|---|---|---|
whisper-large-v3 | OpenAI Whisper Large V3 (1.55B) | Audio transcription, multilingual speech-to-text |
Embeddings¶
| Model name | Base model | Dimensions | Best for |
|---|---|---|---|
qwen3-embedding-8B | Qwen3-Embedding-8B (dense) | 4096 | Text retrieval, semantic search, RAG pipelines, multilingual classification |
Examples¶
Chat Completion¶
curl https://llm.hpc.ut.ee/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-122b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the CAP theorem in two sentences."}
],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024
}'
Install the SDK if you haven't already:
pip install openai
from openai import OpenAI
client = OpenAI(
base_url="https://llm.hpc.ut.ee/v1",
api_key="YOUR_API_KEY",
)
response = client.chat.completions.create(
model="qwen3.5-122b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the CAP theorem in two sentences."},
],
temperature=0.7,
top_p=0.8,
max_tokens=1024,
)
print(response.choices[0].message.content)
import requests
response = requests.post(
"https://llm.hpc.ut.ee/v1/chat/completions",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"model": "qwen3.5-122b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the CAP theorem in two sentences."},
],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024,
},
)
print(response.json()["choices"][0]["message"]["content"])
Streaming¶
Streaming returns tokens incrementally as they are generated, rather than waiting for the full response.
curl https://llm.hpc.ut.ee/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-122b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short poem about distributed systems."}
],
"stream": true,
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024
}'
import httpx
from openai import OpenAI
client = OpenAI(
base_url="https://llm.hpc.ut.ee/v1",
api_key="YOUR_API_KEY",
timeout=httpx.Timeout(300.0, connect=10.0), # (1)!
)
stream = client.chat.completions.create(
model="qwen3.5-122b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short poem about distributed systems."},
],
stream=True,
temperature=0.7,
top_p=0.8,
max_tokens=1024,
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
- A generous read timeout is recommended for streaming — long responses may take several minutes to complete.
Image Understanding¶
Pass an image URL alongside your text prompt. Supported by the qwen3.5-122b model.
curl https://llm.hpc.ut.ee/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-122b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://hpc.ut.ee/assets/images/server_room_3.png"
}
}
]
}
],
"temperature": 0.7,
"top_p": 0.8,
"max_tokens": 1024
}'
from openai import OpenAI
client = OpenAI(
base_url="https://llm.hpc.ut.ee/v1",
api_key="YOUR_API_KEY",
)
response = client.chat.completions.create(
model="qwen3.5-122b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://hpc.ut.ee/assets/images/server_room_3.png"
},
},
],
},
],
temperature=0.7,
top_p=0.8,
max_tokens=1024,
)
print(response.choices[0].message.content)
Transcription¶
Convert audio files to text using whisper-large-v3. Use the language parameter to improve accuracy for a known language.
curl https://llm.hpc.ut.ee/v1/audio/transcriptions \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "model=whisper-large-v3" \
-F "language=et" \
-F "file=@/path/to/clip.mp3"
from openai import OpenAI
client = OpenAI(
base_url="https://llm.hpc.ut.ee/v1",
api_key="YOUR_API_KEY",
)
with open("/path/to/clip.mp3", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
language="et", # (1)!
)
print(transcription.text)
- Optional. Providing the language (ISO-639-1 code) improves speed and accuracy. Omit to enable automatic language detection.
Embeddings¶
Generate vector representations of text for use in semantic search, RAG pipelines, or classification.
curl https://llm.hpc.ut.ee/v1/embeddings \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-embedding-8B",
"input": "The University of Tartu HPC cluster provides GPU compute resources."
}'
from openai import OpenAI
client = OpenAI(
base_url="https://llm.hpc.ut.ee/v1",
api_key="YOUR_API_KEY",
)
response = client.embeddings.create(
model="qwen3-embedding-8B",
input="The University of Tartu HPC cluster provides GPU compute resources.",
)
vector = response.data[0].embedding # 4096-dimensional float list
print(f"Embedding dimensions: {len(vector)}")
Recommended Parameters¶
Tip
Use a lower temperature for factual, analytical, or structured tasks. Use a higher temperature for creative or generative ones.
| Parameter | Recommended value | Description |
|---|---|---|
temperature | 0.7 | Controls randomness. Lower = more deterministic. |
top_p | 0.8 | Nucleus sampling — sample from the top 80% probability mass. |
presence_penalty | 1.5 | Penalises already-used tokens to reduce repetition. |
max_tokens | 1024+ | Maximum tokens to generate. Up to 131,072. |
| Parameter | Recommended value | Description |
|---|---|---|
temperature | 1.0 | Default recommended by Google. |
top_p | 0.95 | Nucleus sampling — sample from the top 95% probability mass. |
top_k | 64 | Limits sampling to the top 64 tokens at each step. |
max_tokens | 1024+ | Maximum tokens to generate. Up to 131,072. |
Rate Limits¶
HTTP 429 — Too Many Requests
Rate limits are enforced per API key. If you receive 429 responses, you have exceeded your quota. Contact support@hpc.ut.ee to review or increase your limits.
Compatibility¶
The API is fully compatible with the OpenAI API specification. Any OpenAI-compatible library or tool works out of the box — set the base URL to https://llm.hpc.ut.ee/v1 and provide your API key.
Compatible tools & frameworks
- LangChain — use
ChatOpenAIwithbase_url="https://llm.hpc.ut.ee/v1" - LlamaIndex — use the
OpenAILLM class with a customapi_base - Continue — configure as an OpenAI-compatible provider in your IDE
- Any tool supporting the
OPENAI_BASE_URLandOPENAI_API_KEYenvironment variables