S.O.T.A. SYSTEMS

Introduction

S.O.T.A. SYSTEMS API Documentation

Pre-Launch Preview - This documentation previews our API launching Q1 2026. Endpoints are not yet live.

Welcome to S.O.T.A. SYSTEMS — flat-rate unlimited access to open-source AI models on NVIDIA DGX GB300s infrastructure.

What Makes Us Different

  • Flat-rate pricing: No per-token charges. Pay a monthly fee, use as much as you need.
  • Concurrent streams: Your tier determines how many simultaneous requests you can run. Requests queue automatically when streams are busy.
  • OpenAI-compatible API: Drop-in replacement. Change your base URL and you're done.
  • Self-hosted in Italy: GDPR-native, zero training on your data, maximum privacy.

Quick Start

# Install the OpenAI SDK
pip install openai
from openai import OpenAI
 
client = OpenAI(
    base_url="https://ai.sota.systems/v1",
    api_key=os.environ.get("SOTA_API_KEY")
)
 
response = client.chat.completions.create(
    model="llama-3.1-70b-instruct",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms"}
    ]
)
 
print(response.choices[0].message.content)

Core Concepts

Concurrent Streams

Each tier includes a set number of concurrent streams — the number of requests your account can process simultaneously.

  • Solo: 3 concurrent streams
  • Team: 10 concurrent streams
  • Platform: Custom allocation

How it works:

  1. You send a request to the API
  2. If a stream is available, your request starts immediately
  3. If all streams are busy, your request enters a queue (max wait: 60 seconds)
  4. When a stream finishes, the next queued request starts automatically

Why streams? Streams give you predictable performance without artificial rate limits. You know exactly how many requests you can run in parallel, and you're never throttled based on token usage.

Authentication

All API requests require authentication via API keys.

API key format: sk-sota-[random-string]

import os
from openai import OpenAI
 
# Set via environment variable
os.environ["SOTA_API_KEY"] = "sk-sota-..."
 
# Or pass directly
client = OpenAI(
    base_url="https://ai.sota.systems/v1",
    api_key="sk-sota-..."
)

You can generate and manage API keys from your dashboard (available at launch).

API Endpoints

All endpoints are hosted at https://ai.sota.systems/v1

Chat Completions

POST /v1/chat/completions

Generate completions for chat-style interactions.

response = client.chat.completions.create(
    model="llama-3.1-70b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.7,
    max_tokens=2000,
    stream=False
)
 
print(response.choices[0].message.content)

Parameters:

  • model (string, required): Model ID (see Models section)
  • messages (array, required): Conversation history
  • temperature (number, optional): Randomness (0-2, default: 1)
  • max_tokens (number, optional): Maximum output length
  • stream (boolean, optional): Enable streaming responses

Streaming Completions

POST /v1/chat/completions with stream: true

stream = client.chat.completions.create(
    model="llama-3.1-70b-instruct",
    messages=[{"role": "user", "content": "Write a haiku about coding"}],
    stream=True
)
 
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

List Models

GET /v1/models

Retrieve all available models for your tier.

curl https://ai.sota.systems/v1/models \
  -H "Authorization: Bearer sk-sota-..."

Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama-3.1-70b-instruct",
      "object": "model",
      "created": 1717545600,
      "owned_by": "meta"
    }
  ]
}

Infrastructure

Hosting

  • Location: Self-hosted in Italy (Europe)
  • Hardware: NVIDIA DGX GB300s
  • Compliance: GDPR-native, no training on user data
  • Encryption: TLS 1.3 for all API traffic

Performance Targets

At launch, we're targeting:

  • 99.95% uptime (4.4 hours/year max downtime)
  • <50ms p95 latency (first token)

These are engineering goals, not guarantees. We'll publish live status pages and incident reports.

Models

All models use simplified naming: provider-model-size-variant

Available Models (Launch)

Model IDParametersContext WindowStrengths
llama-3.1-70b-instruct70B128KGeneral purpose, coding, reasoning
llama-3.1-8b-instruct8B128KFast responses, simple tasks
mistral-large-2-123b123B128KComplex reasoning, long context
qwen-2.5-72b-instruct72B128KMultilingual, math, coding

Need a specific model? Request it here.

Error Handling

Common Status Codes

  • 200: Success
  • 400: Bad request (invalid parameters)
  • 401: Unauthorized (missing or invalid API key)
  • 408: Request timeout (queue wait exceeded 60s)
  • 429: Too many requests (all streams busy, queue full)
  • 500: Server error

Example Error Response

{
  "error": {
    "message": "All concurrent streams are busy. Request queued.",
    "type": "stream_capacity_exceeded",
    "code": 429,
    "retry_after": 5
  }
}

Next Steps

On this page