One API. Every model.

One API key.
Every model.

Route requests across Anthropic, OpenAI, Meta, Google, xAI and more provider models — through a single universal endpoint. No refactoring required.

Get Free API Key →Dashboard

⚡

Lightning Fast

Groq for sub-100ms latency. Production-grade throughput at free-tier pricing.

🐻

Smart Routing

Grizzly 1.0G picks the cheapest model that can handle your prompt — pay only for the routed underlying rate.

🔑

One API Key

Swap models without changing code. One line to integrate with any SDK.

Quick Start

# One-line installer (writes the SDK + your API key to ~/.yieldingbear/)

curl -fsSL https://yieldingbear.com/install.sh | bash

# Then in any Node script (the SDK is vendored at ~/.yieldingbear/yieldingbear.mjs):
import { YieldingBear } from 'yieldingbear.mjs';

const client = await YieldingBear.fromKeyFile();   // picks up ~/.hermes/secrets/yieldingbear-token

const chat = await client.chat.completions.create({
  model: 'yieldingbear/grizzly-1.0g',   // smart router — or pin any specific model
  messages: [{ role: 'user', content: 'Hello!' }],
});

console.log(chat.choices[0].message.content);
console.log('cost: $' + client.cost(chat).toFixed(6));

Supported Providers

Introduction

The Yielding Bear API provides unified access to 50+ large language models through a single universal endpoint. Just change your base URL — no code refactoring required.

Our intelligent routing picks the most efficient model for your reasoning needs — fast models for simple tasks, high-reasoning models for complex work — delivering meaningful cost reductions vs direct API calls. Every request is analyzed, classified, and routed to the optimal model.

The API supports both direct model selection (you pick the exact model) and automatic routing (Yielding Bear picks the best model for your task). Both modes are available through the same endpoint.

Base URL

https://yieldingbear.com/api/v1

Direct Model Selection

Specify the exact model you want. Useful when you need a specific capability.

model: "openai/gpt-4o"

Automatic Routing

Let Yielding Bear choose the optimal model. Best for cost optimization.

model: "yb/default"

Authentication

All API requests require an API key passed as a Bearer token in the Authorization header. Get your API key from the dashboard.

Authorization: Bearer yb_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

🔑

API Key Format

Keys start with the prefix yb_live_sk_ followed by 64 random hex characters (32 bytes of entropy). Example: yb_live_sk_aaf1ccf8b34ebda1c52d…. All keys are production — there is no separate test prefix. Get yours from the dashboard.

🔒

Key Security

Keys are hashed with SHA-256 before storage. The raw key is only shown once at creation — store it securely in environment variables or a secrets manager.

⚠️

Never expose your key

Do not commit API keys to version control. Use environment variables (.env files) and rotate keys regularly from the dashboard.

# Set your API key (prefix is yb_live_sk_, then 64 hex chars)

export YB_API_KEY="yb_live_sk_aaf1ccf8b34ebda1c52d..."

# Verify your key works
curl https://yieldingbear.com/api/v1/usage \
  -H "Authorization: Bearer $YB_API_KEY"

Chat Completions

The Chat Completions endpoint is universally compatible. Send a POST request with messages and receive a structured response. Works with any LLM SDK, including the OpenAI SDKs.

Endpoint

POST /v1/chat/completions

Request Body

{
  "model": "groq/llama-3.1-8b-instant",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is the capital of France?" }
  ],
  "temperature": 0.7,
  "max_tokens": 256,
  "stream": false
}

cURL Example

curl -X POST https://yieldingbear.com/api/v1/chat/completions \
  -H "Authorization: Bearer $YB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "groq/llama-3.1-8b-instant",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Parameters

Parameter	Type	Default	Description
model	string	—	Model identifier (e.g. groq/llama-3.1-8b-instant)
messages	array	—	Array of message objects with role and content
temperature	float	0.7	Sampling temperature (0–2). Lower = more deterministic.
max_tokens	int	256	Maximum tokens to generate
stream	bool	false	Enable streaming responses (SSE)
top_p	float	1.0	Nucleus sampling threshold
frequency_penalty	float	0	Penalty for token frequency (-2 to 2)
presence_penalty	float	0	Penalty for token presence (-2 to 2)

Response Format

{
  "id": "chatcmpl-3552055b-e963-4907-a077-d76f5a1019e7",
  "object": "chat.completion",
  "created": 1781641651,
  "model": "groq/llama-3.1-8b-instant",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The capital of France is Paris."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 36,
    "completion_tokens": 24,
    "total_tokens": 36
  },
  "cost_usd": 0.000017,
  "balance_remaining_usd": 4.9911
}

Response Format — Grizzly 1.0Grouting field

When you call any yieldingbear/grizzly-1.0g* virtual model, the response gets an extra top-level routing object describing the decision, plus matching x-yb-routing-* headers. Use this to log/audit which underlying model actually served the request and to debug forced overrides.

{
  "id": "chatcmpl-3552055b-e963-4907-a077-d76f5a1019e7",
  "object": "chat.completion",
  "created": 1781641651,
  "model": "groq/llama-3-3-70b-versatile",   ← actual underlying model that served the request
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "It's nice to meet you. Is there something I can help you with, or would you like to chat?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 36,
    "completion_tokens": 24,
    "total_tokens": 36
  },
  "cost_usd": 0.000017,
  "balance_remaining_usd": 4.9911,
  "routing": {
    "category": "general",                                       ← which specialist was called
    "intent":   "text",                                          ← text or image
    "tier":     "low",                                           ← low = cheap, high = powerful
    "routed_to": "groq/llama-3-3-70b-versatile",                 ← resolved underlying model
    "reason":    "general:no-high-signals;default-low",          ← why this tier was picked
    "explicit":  false                                           ← true if user forced tier/force via header or body
  }
}

Same grizzly-1.0g-coding call, but with a hard-refactor prompt that triggered the high tier:

{
  "id": "chatcmpl-R1p2pOLmD2huJgdFQLrm0BCd",
  "model": "deepinfra/qwen3-coder-480b",   ← swapped to the 480B free Coder
  ...
  "routing": {
    "category": "coding",
    "intent":   "text",
    "tier":     "high",
    "routed_to": "deepinfra/qwen3-coder-480b",
    "reason":    "cat=coding, kw:\"refactor\"",
    "explicit":  false
  }
}

Headers (every Grizzly request):

x-yb-routing-category — general | coding | creative | finance
x-yb-routing-intent — text | image
x-yb-routing-tier — low | high
x-yb-routing-target — the resolved underlying model id
x-yb-routing-reason — human-readable reason (truncated to 200 chars)

Grizzly 1.0G Examples

Call any of the five Grizzly 1.0G smart routers instead of a specific model. The router classifies the prompt and picks the most efficient underlying model — paying nothing extra on top of the routed model's per-Mtok rate. Same endpoint, same auth, same response shape as Chat Completions. Returns an extra routing field so you can see what was picked.

▸ grizzly-1.0g — General chat

Routes trivial chat to Groq Llama 70B (free), hard reasoning to Claude 3.5 Sonnet.

curl https://yieldingbear.com/api/v1/chat/completions \
  -H "Authorization: Bearer $YB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "yieldingbear/grizzly-1.0g",
    "messages": [
      { "role": "user", "content": "Write a haiku about coding" }
    ]
  }'

▸grizzly-1.0g-coding — Code refactors & real engineering

Trivial snippets → Groq Llama 70B (free). Real refactors → deepinfra/qwen3-coder-480b (480B-param, free).

curl https://yieldingbear.com/api/v1/chat/completions \
  -H "Authorization: Bearer $YB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "yieldingbear/grizzly-1.0g-coding",
    "messages": [
      { "role": "user", "content": "Refactor this Python function to use async/await:\n\ndef fetch_all(urls):\n    return [requests.get(u).json() for u in urls]" }
    ]
  }'

▸grizzly-1.0g-creative — Writing & brand copy

Routes text requests to Groq Llama 70B (free) for short copy, Claude 3.5 Sonnet for long-form.
Note: image generation is currently available via /api/v1/images, not this router. Calling creative with an image prompt returns a 404 today.

curl https://yieldingbear.com/api/v1/chat/completions \
  -H "Authorization: Bearer $YB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "yieldingbear/grizzly-1.0g-creative",
    "messages": [
      { "role": "user", "content": "Write a 3-sentence product description for a bear-themed energy drink" }
    ]
  }'

▸ grizzly-1.0g-finance — Trading, forecasting, live data

Simple Q&A → DeepSeek V3. Real forecasts → deepseek-reasoner. Live data with citations → perplexity/sonar-pro.

curl https://yieldingbear.com/api/v1/chat/completions \
  -H "Authorization: Bearer $YB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "yieldingbear/grizzly-1.0g-finance",
    "messages": [
      { "role": "user", "content": "What is the current price of NVDA and what is its 50-day moving average?" }
    ]
  }'

Tip: All four specialists support "stream": true for SSE streaming, tools/function calling, and JSON mode. Pricing is whatever the routed underlying model charges — there's no Grizzly markup. See /api/v1/pricing for the current blended rates.

Embeddings

Generate text embeddings through a single endpoint. Output is an OpenAI-shaped vector array you can drop into any vector DB (Pinecone, Supabase pgvector, Weaviate, etc.). Use the response data[].embedding field — no parsing needed.

Endpoint

POST /v1/embeddings

Request Body

{
  "model": "deepinfra/bge-base-en-v1.5",
  "input": "The food was delicious and the service was excellent."
}

# cURL example

curl -X POST https://yieldingbear.com/api/v1/embeddings \
  -H "Authorization: Bearer $YB_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepinfra/bge-base-en-v1.5",
    "input": "The food was delicious and the service was excellent."
  }'

Response (truncated)

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "index": 0,
      "embedding": [0.0107, 0.0557, 0.0270, ...768 floats total]
    }
  ],
  "model": "deepinfra/bge-base-en-v1.5",
  "usage": { "prompt_tokens": 4, "total_tokens": 4 },
  "cost_usd": 0,
  "balance_remaining_usd": 4.99
}

Available Embedding Models

deepinfra/bge-base-en-v1.5

768dFree

More embedding models are on the roadmap. Call GET /api/v1/models to see the live catalog.

Models

List all available models. This endpoint is public and does not require authentication. Use it to discover models and check their current pricing.

GET /v1/models

# List all available models (no API key required)

curl https://yieldingbear.com/api/v1/models

Popular Models by Category

Fast / Low Latencygroq/llama-3.1-8b-instant, groq/llama-3.3-70b-versatile, groq/llama-4-scout

Reasoning / Codingdeepinfra/qwen3-coder-480b, anthropic/claude-3-7-sonnet, anthropic/claude-3-5-sonnet-latest, openai/o1-mini

Open-Weight (free)deepinfra/qwen2.5-72b-instruct, deepinfra/llama-3.1-70b-instruct, deepseek/deepseek-chat-v3

Vision / Multimodalopenai/gpt-4o, anthropic/claude-3-5-sonnet-latest, google/gemini-2.0-flash

Live Data / Web Searchperplexity/sonar, perplexity/sonar-pro, xai/grok-3

Run GET /api/v1/models for the full live catalog (currently 48 model IDs).

Rate Limiting

Rate limits protect the API from abuse and ensure fair access. Limits are applied per API key and vary by tier.

Free Tier

Requests/min60

Requests/dayUnlimited

ModelsGroq, Cerebras, Together only

Paid Tier

Requests/min1,000+

Requests/dayUnlimited

ModelsAll providers, custom limits available

429 Response

When you exceed your tier's RPM, the API returns HTTP 429 with a JSON body. The standard Retry-After HTTP header is included (in seconds).

{
  "error": {
    "message": "Rate limit exceeded: 60 requests per minute",
    "type": "rate_limit_exceeded",
    "code": "rate_limit_exceeded"
  }
}

Best Practices

• Implement exponential backoff with jitter for 429 responses
• Cache responses where appropriate to reduce API calls
• Use streaming for large responses to improve perceived latency
• Contact support for higher rate limits on paid tiers

Streaming

Enable streaming to receive tokens as they are generated. This reduces perceived latency for long responses. Use the stream: true parameter.

Streaming Request

{
  "model": "groq/llama-3.1-8b-instant",
  "messages": [{"role": "user", "content": "Write a haiku about coding"}],
  "stream": true
}

Streaming Response (SSE)

data: {"id":"chatcmpl-e96f1581-d9bb-46c9-8f9c-68b27e45a717","choices":[{"finish_reason":null,"index":0,"delta":{"function_call":null,"tool_calls":null,"content":"Lines","role":"assistant"}}],"created":1781641928,"model":"llama-3.1-8b-instant","object":"chat.completion.chunk","system_fingerprint":"fp_e2c608b1d6","usage":{}}

data: {"id":"chatcmpl-e96f1581-d9bb-46c9-8f9c-68b27e45a717","choices":[{"finish_reason":null,"index":0,"delta":{"function_call":null,"tool_calls":null,"content":" of"}}],"created":1781641928,"model":"llama-3.1-8b-instant","object":"chat.completion.chunk","system_fingerprint":"fp_e2c608b1d6","usage":{}}

data: {"id":"chatcmpl-e96f1581-d9bb-46c9-8f9c-68b27e45a717","choices":[{"finish_reason":"stop","index":0,"delta":{"content":null}}],"created":1781641928,"model":"llama-3.1-8b-instant","object":"chat.completion.chunk","system_fingerprint":null,"usage":{}}

data: [DONE]

Python Streaming Example

import sys
sys.path.insert(0, os.path.expanduser("~/.yieldingbear"))
from yieldingbear import YieldingBear

client = YieldingBear.from_key_file()  # picks up ~/.hermes/secrets/yieldingbear-token

for chunk in client.chat.completions.stream(
    model="yieldingbear/grizzly-1.0g",   # or any specific model
    messages=[{"role": "user", "content": "Hello!"}],
):
    if chunk.delta:                      # chunk.delta is a str of the new text
        print(chunk.delta, end="", flush=True)

Usage & Costs

Every chat response includes cost_usd and balance_remaining_usd. For an aggregated view (totals, per-model breakdown, recent requests), call GET /api/v1/usage with your API key.

Endpoint

GET /api/v1/usage

# cURL example

curl https://yieldingbear.com/api/v1/usage \
  -H "Authorization: Bearer $YB_API_KEY"

Response (truncated)

{
  "account": { "email": "you@example.com", "credit_balance_usd": 4.99 },
  "totals": {
    "cost_usd": 0.000452,
    "prompt_tokens": 1359,
    "completion_tokens": 1020,
    "total_tokens": 2379,
    "calls": 50
  },
  "by_model": [
    { "model": "yieldingbear/grizzly-1.0g", "calls": 3, "tokens": 157, "cost_usd": 0.000031 },
    { "model": "anthropic/claude-3-5-sonnet-latest", "calls": 2, "tokens": 34, "cost_usd": 0.000306 }
  ],
  "usage": [
    {
      "id": "edfcb790-d59e-4f51-b543-f45bf39bb3e9",
      "model": "groq/llama-3.1-8b-instant",
      "provider": "groq",
      "input_tokens": 36,
      "output_tokens": 8,
      "total_tokens": 44,
      "cost_usd": 0,
      "latency_ms": 137,
      "created_at": "2026-06-16T20:27:16.396812+00:00"
    }
  ]
}

Error Handling

The API follows OpenAI's error format. All errors return a JSON body with a top-level error object. Always check for errors in your response handling.

{
  "error": {
    "message": "Invalid API key",
    "type": "invalid_request_error",
    "code": "authentication_error",
    "param": null,
    "status": 401
  }
}

Status	Code	Description
400	invalid_request	Malformed request body or missing required fields
401	authentication_error	Invalid or missing API key
402	insufficient_credits	Not enough credits for this request
403	model_not_allowed	Model requires a higher tier or is disabled
422	invalid_parameter	Parameter value is invalid (e.g., temperature > 2)
429	rate_limit_exceeded	Too many requests. Check X-RateLimit-Reset header
500	server_error	Internal server error. Retry with exponential backoff
502	upstream_error	LLM provider returned an error. Try a different model
503	service_unavailable	Service temporarily unavailable. Try again shortly

Billing & Credits

Free tier includes unlimited requests to Groq, Cerebras, and Together models. Paid tier gives access to premium models (OpenAI, Anthropic, Google, DeepSeek, Mistral) with a small markup over provider pricing.

Free

BEST FOR STARTERS$0 / month

Models: Groq, Cerebras, Together

Access: Unlimited requests

Paid

MOST POPULARPay as you go

Models: All providers (50+ models)

Access: Credits-based, min $10 deposit

Enterprise

FOR TEAMSCustom pricing

Models: Custom routing, dedicated support

Access: Volume discounts, SLA

Sample Pricing (Paid Tier)

Model (Grizzly smart router, blended)Input / 1MOutput / 1M

Blended rate = what the routed underlying model costs on average. For exact per-tier rates, see /api/v1/pricing.

grizzly-1.0g (general)

in $1.02out $4.88

grizzly-1.0g (general)$1.02$4.88

grizzly-1.0g-coding

in $0.12out $0.38

grizzly-1.0g-coding$0.12$0.38

grizzly-1.0g-creative

in $1.02out $4.88

grizzly-1.0g-creative$1.02$4.88

grizzly-1.0g-finance

in $0.95out $4.10

grizzly-1.0g-finance$0.95$4.10

SDKs & Libraries

Start with the native Yielding Bear SDK — zero dependencies, vendored by the installer. Every other LLM SDK works too: just point its base_url at https://yieldingbear.com/api/v1.

Yielding Bear SDK

Zero dependencies. Vendored to ~/.yieldingbear/lib/ by the installer.

Python

Node

Frequently Asked Questions

How does Yielding Bear save money?

Our unified API sits in front of 50++ models and routes each request to the most efficient model for the task. For simple chat it routes to Groq Llama 3.3 70B (free). For complex reasoning it routes to Claude 3.5 Sonnet only when needed. Most teams see meaningful cost reductions vs direct API costs.

Do I need multiple API keys?

No. One API key from Yielding Bear connects you to all providers. We handle provider credentials, rate limits, retries, and failover automatically.

Which SDKs are supported?

Integrate Any Model — any LLM SDK that points to an OpenAI-style base URL is supported out of the box, including the OpenAI, Anthropic, Google GenAI, Mistral, and Cohere SDKs. Just change the base URL to https://yieldingbear.com/api/v1 and add your Yielding Bear API key — everything else works the same.

What happens if a provider goes down?

Yielding Bear automatically retries on an alternative provider. Configure fallback chains in the dashboard or let our routing engine handle it automatically.

Can I fine-tune models through Yielding Bear?

Fine-tuning is not currently exposed through the public API. Contact us for enterprise fine-tuning partnerships.

How do I monitor usage?

Call GET /api/v1/usage with your API key to get totals + per-model breakdowns in JSON, or use the dashboard for charts and credit top-ups.

Can I use Yielding Bear in production today?

Yes. The API is production-ready with 99.9% uptime SLA on paid plans. Free tier is best for development and testing.

YIELDING BEAR FLAGSHIP

Ready to start routing LLMs?

Get a free API key. No credit card required. Start routing every LLM call automatically.

Get Free API Key →

One API key.Every model.

Lightning Fast

Smart Routing

One API Key

Introduction

Direct Model Selection

Automatic Routing

Authentication

API Key Format

Key Security

Never expose your key

Chat Completions

Response Format

Response Format — Grizzly 1.0Grouting field

Grizzly 1.0G Examples

Embeddings

Response (truncated)

Available Embedding Models

Models

Popular Models by Category

Rate Limiting

Free Tier

Paid Tier

429 Response

Best Practices

Streaming

Python Streaming Example

Usage & Costs

Response (truncated)

Error Handling

Billing & Credits

Free

Paid

Enterprise

Sample Pricing (Paid Tier)

SDKs & Libraries

Yielding Bear SDK

Frequently Asked Questions

How does Yielding Bear save money?

Do I need multiple API keys?

Which SDKs are supported?

What happens if a provider goes down?

Can I fine-tune models through Yielding Bear?

How do I monitor usage?

Can I use Yielding Bear in production today?

Ready to start routing LLMs?

One API key.
Every model.