Introduction
S.O.T.A. SYSTEMS API Documentation
Pre-Launch Preview - This documentation previews our API launching Q1 2026. Endpoints are not yet live.
Welcome to S.O.T.A. SYSTEMS — flat-rate unlimited access to open-source AI models on NVIDIA DGX GB300s infrastructure.
What Makes Us Different
- Flat-rate pricing: No per-token charges. Pay a monthly fee, use as much as you need.
- Concurrent streams: Your tier determines how many simultaneous requests you can run. Requests queue automatically when streams are busy.
- OpenAI-compatible API: Drop-in replacement. Change your base URL and you're done.
- Self-hosted in Italy: GDPR-native, zero training on your data, maximum privacy.
Quick Start
Core Concepts
Concurrent Streams
Each tier includes a set number of concurrent streams — the number of requests your account can process simultaneously.
- Solo: 3 concurrent streams
- Team: 10 concurrent streams
- Platform: Custom allocation
How it works:
- You send a request to the API
- If a stream is available, your request starts immediately
- If all streams are busy, your request enters a queue (max wait: 60 seconds)
- When a stream finishes, the next queued request starts automatically
Why streams? Streams give you predictable performance without artificial rate limits. You know exactly how many requests you can run in parallel, and you're never throttled based on token usage.
Authentication
All API requests require authentication via API keys.
API key format: sk-sota-[random-string]
You can generate and manage API keys from your dashboard (available at launch).
API Endpoints
All endpoints are hosted at https://ai.sota.systems/v1
Chat Completions
POST /v1/chat/completions
Generate completions for chat-style interactions.
Parameters:
model(string, required): Model ID (see Models section)messages(array, required): Conversation historytemperature(number, optional): Randomness (0-2, default: 1)max_tokens(number, optional): Maximum output lengthstream(boolean, optional): Enable streaming responses
Streaming Completions
POST /v1/chat/completions with stream: true
List Models
GET /v1/models
Retrieve all available models for your tier.
Response:
Infrastructure
Hosting
- Location: Self-hosted in Italy (Europe)
- Hardware: NVIDIA DGX GB300s
- Compliance: GDPR-native, no training on user data
- Encryption: TLS 1.3 for all API traffic
Performance Targets
At launch, we're targeting:
- 99.95% uptime (4.4 hours/year max downtime)
- <50ms p95 latency (first token)
These are engineering goals, not guarantees. We'll publish live status pages and incident reports.
Models
All models use simplified naming: provider-model-size-variant
Available Models (Launch)
| Model ID | Parameters | Context Window | Strengths |
|---|---|---|---|
llama-3.1-70b-instruct | 70B | 128K | General purpose, coding, reasoning |
llama-3.1-8b-instruct | 8B | 128K | Fast responses, simple tasks |
mistral-large-2-123b | 123B | 128K | Complex reasoning, long context |
qwen-2.5-72b-instruct | 72B | 128K | Multilingual, math, coding |
Need a specific model? Request it here.
Error Handling
Common Status Codes
200: Success400: Bad request (invalid parameters)401: Unauthorized (missing or invalid API key)408: Request timeout (queue wait exceeded 60s)429: Too many requests (all streams busy, queue full)500: Server error