LLM Inference API Guide¶

The UT HPC Center provides an OpenAI-compatible LLM API with access to chat, transcription, and embedding models.

Base URL: https://llm.hpc.ut.ee/v1

Getting Access¶

API Key Required

Contact support@hpc.ut.ee to request a personal API key. Keys are used for authentication and usage tracking — keep yours confidential and do not share it.

Available Models¶

Chat & Reasoning¶

Model name	Base model	Best for	Pricing
`qwen3.5-122b`	Qwen3.5 122B MoE (10B active, INT4)	Complex reasoning, math, coding, analysis — thinking enabled	Input: €0.26 / 1M tokens Output: €2.08 / 1M tokens
`qwen3.5-122b-nonthinking`	Qwen3.5 122B MoE (10B active, INT4)	Fast responses, Q&A, translation, summarization — thinking disabled	Input: €0.26 / 1M tokens Output: €2.08 / 1M tokens
`qwen3.6-35b-a3b`	Qwen3.6 35B MoE (3B active)	Fast reasoning: coding, math, structured problem solving, and technical questions	Input: €0.15 / 1M tokens Output: €1.00 / 1M tokens
`gemma-4-31B-it`	Gemma 4 31B IT (dense)	General-purpose tasks, instruction following, multilingual	Input: €0.12 / 1M tokens Output: €0.37 / 1M tokens
`gemma-4-26b-a4b`	Gemma 4 26B MoE (4B active)	General-purpose tasks, instruction following, general productivity with a speed/quality trade-off	Input: €0.06 / 1M tokens Output: €0.33 / 1M tokens

Transcription¶

Model name	Base model	Best for	Pricing
`whisper-large-v3`	OpenAI Whisper Large V3 (1.55B)	Audio transcription, multilingual speech-to-text	Input: €0.0001 / second

Embeddings¶

Model name	Base model	Dimensions	Best for	Pricing
`qwen3-embedding-8B`	Qwen3-Embedding-8B (dense)	4096	Text retrieval, semantic search, RAG pipelines, multilingual classification	Input: €0.01 / 1M tokens

Examples¶

Chat Completion¶

curl Python (OpenAI) Python (requests)

curl https://llm.hpc.ut.ee/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-122b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain the CAP theorem in two sentences."}
    ],
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 1024
  }'

Install the SDK if you haven't already:

pip install openai

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hpc.ut.ee/v1",
    api_key="YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-122b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the CAP theorem in two sentences."},
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=1024,
)

print(response.choices[0].message.content)

import requests

response = requests.post(
    "https://llm.hpc.ut.ee/v1/chat/completions",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "model": "qwen3.5-122b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain the CAP theorem in two sentences."},
        ],
        "temperature": 0.7,
        "top_p": 0.8,
        "max_tokens": 1024,
    },
)

print(response.json()["choices"][0]["message"]["content"])

Streaming¶

Streaming returns tokens incrementally as they are generated, rather than waiting for the full response.

curl Python (OpenAI)

curl https://llm.hpc.ut.ee/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-122b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Write a short poem about distributed systems."}
    ],
    "stream": true,
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 1024
  }'

import httpx
from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hpc.ut.ee/v1",
    api_key="YOUR_API_KEY",
    timeout=httpx.Timeout(300.0, connect=10.0),  # (1)!
)

stream = client.chat.completions.create(
    model="qwen3.5-122b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a short poem about distributed systems."},
    ],
    stream=True,
    temperature=0.7,
    top_p=0.8,
    max_tokens=1024,
)

for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

A generous read timeout is recommended for streaming — long responses may take several minutes to complete.

Image Understanding¶

Pass an image URL alongside your text prompt. Supported by the qwen3.5-122b model.

curl Python (OpenAI)

curl https://llm.hpc.ut.ee/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-122b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {
            "type": "image_url",
            "image_url": {
              "url": "https://hpc.ut.ee/assets/images/server_room_3.png"
            }
          }
        ]
      }
    ],
    "temperature": 0.7,
    "top_p": 0.8,
    "max_tokens": 1024
  }'

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hpc.ut.ee/v1",
    api_key="YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="qwen3.5-122b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://hpc.ut.ee/assets/images/server_room_3.png"
                    },
                },
            ],
        },
    ],
    temperature=0.7,
    top_p=0.8,
    max_tokens=1024,
)

print(response.choices[0].message.content)

Transcription¶

Convert audio files to text using whisper-large-v3. Use the language parameter to improve accuracy for a known language.

curl Python (OpenAI)

curl https://llm.hpc.ut.ee/v1/audio/transcriptions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "model=whisper-large-v3" \
  -F "language=et" \
  -F "file=@/path/to/clip.mp3"

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hpc.ut.ee/v1",
    api_key="YOUR_API_KEY",
)

with open("/path/to/clip.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=audio_file,
        language="et",  # (1)!
    )

print(transcription.text)

Optional. Providing the language (ISO-639-1 code) improves speed and accuracy. Omit to enable automatic language detection.

Embeddings¶

Generate vector representations of text for use in semantic search, RAG pipelines, or classification.

curl Python (OpenAI)

curl https://llm.hpc.ut.ee/v1/embeddings \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-embedding-8B",
    "input": "The University of Tartu HPC cluster provides GPU compute resources."
  }'

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.hpc.ut.ee/v1",
    api_key="YOUR_API_KEY",
)

response = client.embeddings.create(
    model="qwen3-embedding-8B",
    input="The University of Tartu HPC cluster provides GPU compute resources.",
)

vector = response.data[0].embedding  # 4096-dimensional float list
print(f"Embedding dimensions: {len(vector)}")

Recommended Parameters¶

Tip

Use a lower temperature for factual, analytical, or structured tasks. Use a higher temperature for creative or generative ones.

Qwen3.5-122BGemma 4 31B IT

Parameter	Recommended value	Description
`temperature`	`0.7`	Controls randomness. Lower = more deterministic.
`top_p`	`0.8`	Nucleus sampling — sample from the top 80% probability mass.
`presence_penalty`	`1.5`	Penalises already-used tokens to reduce repetition.
`max_tokens`	`1024`+	Maximum tokens to generate. Up to 131,072.

Parameter	Recommended value	Description
`temperature`	`1.0`	Default recommended by Google.
`top_p`	`0.95`	Nucleus sampling — sample from the top 95% probability mass.
`top_k`	`64`	Limits sampling to the top 64 tokens at each step.
`max_tokens`	`1024`+	Maximum tokens to generate. Up to 131,072.

Rate Limits¶

HTTP 429 — Too Many Requests

Rate limits are enforced per API key. If you receive 429 responses, you have exceeded your quota. Contact support@hpc.ut.ee to review or increase your limits.

Compatibility¶

The API is fully compatible with the OpenAI API specification. Any OpenAI-compatible library or tool works out of the box — set the base URL to https://llm.hpc.ut.ee/v1 and provide your API key.

Compatible tools & frameworks

LangChain — use ChatOpenAI with base_url="https://llm.hpc.ut.ee/v1"
LlamaIndex — use the OpenAI LLM class with a custom api_base
Continue — configure as an OpenAI-compatible provider in your IDE
Any tool supporting the OPENAI_BASE_URL and OPENAI_API_KEY environment variables