Introduction

S.O.T.A. SYSTEMS API Documentation

Pre-Launch Preview - This documentation previews our API launching Q1 2026. Endpoints are not yet live.

Welcome to S.O.T.A. SYSTEMS — flat-rate unlimited access to open-source AI models on NVIDIA DGX GB300s infrastructure.

What Makes Us Different

Flat-rate pricing: No per-token charges. Pay a monthly fee, use as much as you need.
Concurrent streams: Your tier determines how many simultaneous requests you can run. Requests queue automatically when streams are busy.
OpenAI-compatible API: Drop-in replacement. Change your base URL and you're done.
Self-hosted in Italy: GDPR-native, zero training on your data, maximum privacy.

Quick Start

# Install the OpenAI SDK
pip install openai

from openai import OpenAI
 
client = OpenAI(
    base_url="https://ai.sota.systems/v1",
    api_key=os.environ.get("SOTA_API_KEY")
)
 
response = client.chat.completions.create(
    model="llama-3.1-70b-instruct",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms"}
    ]
)
 
print(response.choices[0].message.content)

Core Concepts

Concurrent Streams

Each tier includes a set number of concurrent streams — the number of requests your account can process simultaneously.

Solo: 3 concurrent streams
Team: 10 concurrent streams
Platform: Custom allocation

How it works:

You send a request to the API
If a stream is available, your request starts immediately
If all streams are busy, your request enters a queue (max wait: 60 seconds)
When a stream finishes, the next queued request starts automatically

Why streams? Streams give you predictable performance without artificial rate limits. You know exactly how many requests you can run in parallel, and you're never throttled based on token usage.

Authentication

All API requests require authentication via API keys.

API key format: sk-sota-[random-string]

import os
from openai import OpenAI
 
# Set via environment variable
os.environ["SOTA_API_KEY"] = "sk-sota-..."
 
# Or pass directly
client = OpenAI(
    base_url="https://ai.sota.systems/v1",
    api_key="sk-sota-..."
)

You can generate and manage API keys from your dashboard (available at launch).

API Endpoints

All endpoints are hosted at https://ai.sota.systems/v1

Chat Completions

POST /v1/chat/completions

Generate completions for chat-style interactions.

response = client.chat.completions.create(
    model="llama-3.1-70b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.7,
    max_tokens=2000,
    stream=False
)
 
print(response.choices[0].message.content)

Parameters:

model (string, required): Model ID (see Models section)
messages (array, required): Conversation history
temperature (number, optional): Randomness (0-2, default: 1)
max_tokens (number, optional): Maximum output length
stream (boolean, optional): Enable streaming responses

Streaming Completions

POST /v1/chat/completions with stream: true

stream = client.chat.completions.create(
    model="llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about coding"}],
    stream=True
)
 
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

List Models

GET /v1/models

Retrieve all available models for your tier.

curl https://ai.sota.systems/v1/models \
  -H "Authorization: Bearer sk-sota-..."

Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama-3.1-70b-instruct",
      "object": "model",
      "created": 1717545600,
      "owned_by": "meta"
    }
  ]
}

Infrastructure

Hosting

Location: Self-hosted in Italy (Europe)
Hardware: NVIDIA DGX GB300s
Compliance: GDPR-native, no training on user data
Encryption: TLS 1.3 for all API traffic

Performance Targets

At launch, we're targeting:

99.95% uptime (4.4 hours/year max downtime)
<50ms p95 latency (first token)

These are engineering goals, not guarantees. We'll publish live status pages and incident reports.

Models

All models use simplified naming: provider-model-size-variant

Available Models (Launch)

Model ID	Parameters	Context Window	Strengths
`llama-3.1-70b-instruct`	70B	128K	General purpose, coding, reasoning
`llama-3.1-8b-instruct`	8B	128K	Fast responses, simple tasks
`mistral-large-2-123b`	123B	128K	Complex reasoning, long context
`qwen-2.5-72b-instruct`	72B	128K	Multilingual, math, coding

Need a specific model? Request it here.

Error Handling

Common Status Codes

200: Success
400: Bad request (invalid parameters)
401: Unauthorized (missing or invalid API key)
408: Request timeout (queue wait exceeded 60s)
429: Too many requests (all streams busy, queue full)
500: Server error

Example Error Response

{
  "error": {
    "message": "All concurrent streams are busy. Request queued.",
    "type": "stream_capacity_exceeded",
    "code": 429,
    "retry_after": 5
  }
}