One API. Every model.

One API key.
Every model.

Route requests across Groq, Cerebras, OpenAI, Anthropic, Google, DeepSeek, Mistral, and 100+ more models — through a single OpenAI-compatible endpoint. Free tier included. No refactoring required.

Lightning Fast

Groq for sub-100ms latency. Production-grade throughput at free-tier pricing.

🛡️

Guardrails Built-In

Content filtering, PII redaction, and prompt injection prevention on every request.

🔑

One API Key

Swap models without changing code. OpenAI-compatible — drop-in replacement.

Quick Start

# Install the OpenAI SDK
npm install openai

# Make your first request
import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.YB_API_KEY,
  baseURL: 'https://api.yieldingbear.com/v1',
});

const chat = await client.chat.completions.create({
  model: 'groq/llama-3.1-8b-instant',  // or any model - routing is automatic
  messages: [{ role: 'user', content: 'Hello!' }],
});

console.log(chat.choices[0].message.content);

Supported Providers

groq
cerebras
openai
anthropic
google
deepseek
mistral
meta
YB FLAGSHIP

Ready to cut your LLM costs?

Get a free API key. No credit card required. Start making smarter, cheaper LLM calls in minutes.

Introduction

The Yielding Bear API provides unified access to 100+ large language models through a single OpenAI-compatible endpoint. Just change your base URL — no code refactoring required.

Our intelligent routing automatically selects the cheapest and fastest model for your reasoning needs — delivering 60-80% cost savings vs direct API calls. Every request is analyzed, classified, and routed to the optimal model.

The API supports both direct model selection (you pick the exact model) and automatic routing (YB picks the best model for your task). Both modes are available through the same endpoint.

Base URL

https://api.yieldingbear.com/v1

Direct Model Selection

Specify the exact model you want. Useful when you need a specific capability.

model: "openai/gpt-4o"

Automatic Routing

Let YB choose the optimal model. Best for cost optimization.

model: "yb/default"

Authentication

All API requests require an API key passed as a Bearer token in the Authorization header. Get your API key from the dashboard.

Authorization: Bearer yb_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
🔑

API Key Format

Keys start with yb_live_ (production) or yb_test_ (test). 32 random hex characters follow.

🔒

Key Security

Keys are hashed with SHA-256 before storage. The raw key is only shown once at creation — store it securely in environment variables or a secrets manager.

⚠️

Never expose your key

Do not commit API keys to version control. Use environment variables (.env files) and rotate keys regularly from the dashboard.

# Set your API key
export YB_API_KEY="yb_live_your_key_here"

# Verify your key works
curl https://api.yieldingbear.com/v1/models   -H "Authorization: Bearer $YB_API_KEY"

Chat Completions

The Chat Completions endpoint is fully OpenAI-compatible. Send a POST request with messages and receive a structured response. Works with all OpenAI SDKs.

Endpoint

POST /v1/chat/completions

Request Body

{
  "model": "groq/llama-3.1-8b-instant",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is the capital of France?" }
  ],
  "temperature": 0.7,
  "max_tokens": 256,
  "stream": false
}

cURL Example

curl -X POST https://api.yieldingbear.com/v1/chat/completions   -H "Authorization: Bearer $YB_API_KEY"   -H "Content-Type: application/json"   -d '{
    "model": "groq/llama-3.1-8b-instant",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Parameters

ParameterTypeDefaultDescription
modelstringModel identifier (e.g. groq/llama-3.1-8b-instant)
messagesarrayArray of message objects with role and content
temperaturefloat0.7Sampling temperature (0–2). Lower = more deterministic.
max_tokensint256Maximum tokens to generate
streamboolfalseEnable streaming responses (SSE)
top_pfloat1.0Nucleus sampling threshold
frequency_penaltyfloat0Penalty for token frequency (-2 to 2)
presence_penaltyfloat0Penalty for token presence (-2 to 2)

Response Format

{
  "id": "chatcmpl_xxxxxxxxxxxx",
  "object": "chat.completion",
  "created": 1717000000,
  "model": "groq/llama-3.1-8b-instant",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "The capital of France is Paris."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 12,
    "total_tokens": 27
  }
}

Embeddings

Generate text embeddings using any supported model. Fully compatible with OpenAI's embeddings API. Embeddings are useful for semantic search, similarity matching, and RAG (Retrieval Augmented Generation) pipelines.

Endpoint

POST /v1/embeddings

Request

{
  "model": "Cohere/embed-english-v3.0",
  "input": "The food was delicious and the service was excellent."
}
# cURL example
curl -X POST https://api.yieldingbear.com/v1/embeddings   -H "Authorization: Bearer $YB_API_KEY"   -H "Content-Type: application/json"   -d '{
    "model": "Cohere/embed-english-v3.0",
    "input": "The food was delicious and the service was excellent."
  }'

Recommended Embedding Models

Cohere/embed-english-v3.0
1024d$0.10/1M tokens
Cohere/embed-multilingual-v3
1024d$0.10/1M tokens
OpenAI/text-embedding-3-small
1536d$0.02/1M tokens
Google/text-embedding-004
768d$0.01/1M tokens

Models

List all available models. This endpoint is public and does not require authentication. Use it to discover models and check their current pricing.

GET /v1/models
# List all available models (no API key required)
curl https://api.yieldingbear.com/v1/models

Popular Models by Category

Fast / Low LatencyGroq Llama 3.1 8B, Cerebras Llama 3.1 70B, Groq Mixtral 8x7B
Reasoning / CodingClaude 3.7 Sonnet, GPT-4o, DeepSeek R1
Cost EfficientLlama 3.1 8B ($0.04/1M), DeepSeek V3 ($0.07/1M), Gemini 2.0 Flash ($0.10/1M)
Vision / MultimodalGPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash

Rate Limiting

Rate limits protect the API from abuse and ensure fair access. Limits are applied per API key and vary by tier.

Free Tier

Requests/min60
Requests/dayUnlimited
ModelsGroq, Cerebras, Together only

Paid Tier

Requests/min1,000+
Requests/dayUnlimited
ModelsAll providers, custom limits available

Rate limit headers are included in every response:

X-RateLimit-Limit: 60 X-RateLimit-Remaining: 59 X-RateLimit-Reset: 1717000060

Best Practices

  • • Implement exponential backoff with jitter for 429 responses
  • • Cache responses where appropriate to reduce API calls
  • • Use streaming for large responses to improve perceived latency
  • • Contact support for higher rate limits on paid tiers

Streaming

Enable streaming to receive tokens as they are generated. This reduces perceived latency for long responses. Use the stream: true parameter.

Streaming Request

{
  "model": "groq/llama-3.1-8b-instant",
  "messages": [{"role": "user", "content": "Write a haiku about coding"}],
  "stream": true
}

Streaming Response (SSE)

data: {"id":"1","choices":[{"delta":{"content":"While"},"index":0}]}

data: {"id":"1","choices":[{"delta":{"content":" my"},"index":0}]}

data: {"id":"1","choices":[{"delta":{"content":" fingers"},"index":0}]}

data: {"id":"1","choices":[{"finish_reason":"stop","index":0,"delta":{}}]}

data: [DONE]

Python Streaming Example

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["YB_API_KEY"],
    base_url="https://api.yieldingbear.com/v1",
)

stream = client.chat.completions.create(
    model="groq/llama-3.1-8b-instant",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Batch Requests

Process multiple prompts in a single API call. Batch requests are more efficient than making individual calls and can reduce overall latency for bulk operations.

Endpoint

POST /v1/batch
curl -X POST https://api.yieldingbear.com/v1/batch   -H "Authorization: Bearer $YB_API_KEY"   -H "Content-Type: application/json"   -d '{
    "requests": [
      {"model": "groq/llama-3.1-8b-instant", "messages": [{"role": "user", "content": "Hi"}]},
      {"model": "groq/llama-3.1-8b-instant", "messages": [{"role": "user", "content": "Bye"}]}
    ]
  }'

Use Cases

Batch classification

Label 1000s of texts at once

Bulk translation

Translate documents in one call

Parallel summarization

Summarize a batch of articles

Data enrichment

Add AI-generated fields to records

Guardrails

Built-in safety and content filtering on every request. Guardrails run at the API layer — they add zero latency and cost nothing extra.

🚫

Prompt Injection Prevention

Detects and blocks attempts to override system instructions through user input.

🔒

PII Redaction

Automatically detects and redacts personally identifiable information (SSN, credit cards, etc).

🛡️

Harmful Content Filter

Blocks requests and responses containing violence, self-harm, or illegal content.

📊

Toxicity Scoring

Returns a toxicity score for every response. Use it to filter or flag content.

Guardrail Response Example

{
  "error": {
    "message": "Request blocked by guardrails: prompt injection detected",
    "type": "content_policy_violation",
    "code": "prompt_injection_blocked"
  }
}

Error Handling

The API follows OpenAI's error format. All errors return a JSON body with a top-level error object. Always check for errors in your response handling.

{
  "error": {
    "message": "Invalid API key",
    "type": "invalid_request_error",
    "code": "authentication_error",
    "param": null,
    "status": 401
  }
}
StatusCodeDescription
400invalid_requestMalformed request body or missing required fields
401authentication_errorInvalid or missing API key
402insufficient_creditsNot enough credits for this request
403model_not_allowedModel requires a higher tier or is disabled
422invalid_parameterParameter value is invalid (e.g., temperature > 2)
429rate_limit_exceededToo many requests. Check X-RateLimit-Reset header
500server_errorInternal server error. Retry with exponential backoff
502upstream_errorLLM provider returned an error. Try a different model
503service_unavailableService temporarily unavailable. Try again shortly

Billing & Credits

Free tier includes unlimited requests to Groq, Cerebras, and Together models. Paid tier gives access to premium models (OpenAI, Anthropic, Google, DeepSeek, Mistral) with a 20% markup over provider pricing.

Free

BEST FOR STARTERS
$0 / month

Models: Groq, Cerebras, Together

Access: Unlimited requests

Paid

MOST POPULAR
Pay as you go

Models: All providers (100+ models)

Access: Credits-based, min $10 deposit

Enterprise

FOR TEAMS
Custom pricing

Models: Custom routing, dedicated support

Access: Volume discounts, SLA

Sample Pricing (Paid Tier)

Groq Llama 3.1 8B
in: $0.04/1Mout: $0.04/1M
OpenAI GPT-4o-mini
in: $0.18/1Mout: $0.72/1M
Anthropic Claude 3.5 Sonnet
in: $0.96/1Mout: $4.80/1M
Google Gemini 2.0 Flash
in: $0.12/1Mout: $0.48/1M
DeepSeek V3
in: $0.08/1Mout: $0.32/1M

SDKs & Libraries

Use any OpenAI-compatible SDK. Just change the base URL and add your API key. Official SDKs and community libraries are all supported.

Python

Python

OpenAI SDKLangChainLlamaIndexDSPy
JavaScript

JavaScript

OpenAI SDK (Node)Vercel AI SDKLangChain.js
Go

Go

go-openaiLangChain Go
Rust

Rust

rust-openaireqwest + serde
# Python with LangChain
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="groq/llama-3.1-8b-instant",
    api_key=os.environ["YB_API_KEY"],
    base_url="https://api.yieldingbear.com/v1",
)

response = llm.invoke("Hello!")

Frequently Asked Questions

How does Yielding Bear save money?

Our unified API sits in front of 16+ LLM providers and routes each request to the cheapest model that can handle the task. For simple tasks like summarization, it routes to Llama 3B at $0.04/1M. For complex reasoning, it routes to Claude 70B only when needed. Most teams save 60-80% vs direct API costs.

Do I need multiple API keys?

No. One API key from Yielding Bear connects you to all providers. We handle provider credentials, rate limits, retries, and failover automatically.

Is the API OpenAI-compatible?

Yes. The API is a drop-in replacement for the OpenAI API. Change your base URL to https://api.yieldingbear.com/v1 and add your YB API key — everything else works the same.

What happens if a provider goes down?

YB automatically retries on an alternative provider. Configure fallback chains in the dashboard or let our routing engine handle it automatically.

Can I fine-tune models through YB?

Yes, we support fine-tuning through Together.ai and other providers. Contact us for custom fine-tuning on enterprise plans.

What's included in guardrails?

Prompt injection detection, PII redaction, harmful content filtering, and toxicity scoring — all included on every request at no extra cost.

How do I monitor usage?

Use the dashboard to track API usage, costs by model, and error rates. API access logs are available for 30 days on all plans.

Can I use YB in production today?

Yes. The API is production-ready with 99.9% uptime SLA on paid plans. Free tier is best for development and testing.