One API key.
Every model.
Route requests across Groq, Cerebras, OpenAI, Anthropic, Google, DeepSeek, Mistral, and 100+ more models — through a single OpenAI-compatible endpoint. Free tier included. No refactoring required.
Lightning Fast
Groq for sub-100ms latency. Production-grade throughput at free-tier pricing.
Guardrails Built-In
Content filtering, PII redaction, and prompt injection prevention on every request.
One API Key
Swap models without changing code. OpenAI-compatible — drop-in replacement.
Quick Start
npm install openai
# Make your first request
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: process.env.YB_API_KEY,
baseURL: 'https://api.yieldingbear.com/v1',
});
const chat = await client.chat.completions.create({
model: 'groq/llama-3.1-8b-instant', // or any model - routing is automatic
messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(chat.choices[0].message.content);Supported Providers
Ready to cut your LLM costs?
Get a free API key. No credit card required. Start making smarter, cheaper LLM calls in minutes.
Introduction
The Yielding Bear API provides unified access to 100+ large language models through a single OpenAI-compatible endpoint. Just change your base URL — no code refactoring required.
Our intelligent routing automatically selects the cheapest and fastest model for your reasoning needs — delivering 60-80% cost savings vs direct API calls. Every request is analyzed, classified, and routed to the optimal model.
The API supports both direct model selection (you pick the exact model) and automatic routing (YB picks the best model for your task). Both modes are available through the same endpoint.
Base URL
https://api.yieldingbear.com/v1Direct Model Selection
Specify the exact model you want. Useful when you need a specific capability.
model: "openai/gpt-4o"Automatic Routing
Let YB choose the optimal model. Best for cost optimization.
model: "yb/default"Authentication
All API requests require an API key passed as a Bearer token in the Authorization header. Get your API key from the dashboard.
Authorization: Bearer yb_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
API Key Format
Keys start with yb_live_ (production) or yb_test_ (test). 32 random hex characters follow.
Key Security
Keys are hashed with SHA-256 before storage. The raw key is only shown once at creation — store it securely in environment variables or a secrets manager.
Never expose your key
Do not commit API keys to version control. Use environment variables (.env files) and rotate keys regularly from the dashboard.
export YB_API_KEY="yb_live_your_key_here" # Verify your key works curl https://api.yieldingbear.com/v1/models -H "Authorization: Bearer $YB_API_KEY"
Chat Completions
The Chat Completions endpoint is fully OpenAI-compatible. Send a POST request with messages and receive a structured response. Works with all OpenAI SDKs.
Endpoint
POST /v1/chat/completions
Request Body
{
"model": "groq/llama-3.1-8b-instant",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "What is the capital of France?" }
],
"temperature": 0.7,
"max_tokens": 256,
"stream": false
}cURL Example
curl -X POST https://api.yieldingbear.com/v1/chat/completions -H "Authorization: Bearer $YB_API_KEY" -H "Content-Type: application/json" -d '{
"model": "groq/llama-3.1-8b-instant",
"messages": [{"role": "user", "content": "Hello!"}]
}'Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| model | string | — | Model identifier (e.g. groq/llama-3.1-8b-instant) |
| messages | array | — | Array of message objects with role and content |
| temperature | float | 0.7 | Sampling temperature (0–2). Lower = more deterministic. |
| max_tokens | int | 256 | Maximum tokens to generate |
| stream | bool | false | Enable streaming responses (SSE) |
| top_p | float | 1.0 | Nucleus sampling threshold |
| frequency_penalty | float | 0 | Penalty for token frequency (-2 to 2) |
| presence_penalty | float | 0 | Penalty for token presence (-2 to 2) |
Response Format
{
"id": "chatcmpl_xxxxxxxxxxxx",
"object": "chat.completion",
"created": 1717000000,
"model": "groq/llama-3.1-8b-instant",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 12,
"total_tokens": 27
}
}Embeddings
Generate text embeddings using any supported model. Fully compatible with OpenAI's embeddings API. Embeddings are useful for semantic search, similarity matching, and RAG (Retrieval Augmented Generation) pipelines.
Endpoint
POST /v1/embeddings
Request
{
"model": "Cohere/embed-english-v3.0",
"input": "The food was delicious and the service was excellent."
}curl -X POST https://api.yieldingbear.com/v1/embeddings -H "Authorization: Bearer $YB_API_KEY" -H "Content-Type: application/json" -d '{
"model": "Cohere/embed-english-v3.0",
"input": "The food was delicious and the service was excellent."
}'Recommended Embedding Models
Cohere/embed-english-v3.0Cohere/embed-multilingual-v3OpenAI/text-embedding-3-smallGoogle/text-embedding-004Models
List all available models. This endpoint is public and does not require authentication. Use it to discover models and check their current pricing.
GET /v1/models
curl https://api.yieldingbear.com/v1/models
Popular Models by Category
Rate Limiting
Rate limits protect the API from abuse and ensure fair access. Limits are applied per API key and vary by tier.
Free Tier
Paid Tier
Rate limit headers are included in every response:
X-RateLimit-Limit: 60 X-RateLimit-Remaining: 59 X-RateLimit-Reset: 1717000060
Best Practices
- • Implement exponential backoff with jitter for 429 responses
- • Cache responses where appropriate to reduce API calls
- • Use streaming for large responses to improve perceived latency
- • Contact support for higher rate limits on paid tiers
Streaming
Enable streaming to receive tokens as they are generated. This reduces perceived latency for long responses. Use the stream: true parameter.
Streaming Request
{
"model": "groq/llama-3.1-8b-instant",
"messages": [{"role": "user", "content": "Write a haiku about coding"}],
"stream": true
}Streaming Response (SSE)
data: {"id":"1","choices":[{"delta":{"content":"While"},"index":0}]}
data: {"id":"1","choices":[{"delta":{"content":" my"},"index":0}]}
data: {"id":"1","choices":[{"delta":{"content":" fingers"},"index":0}]}
data: {"id":"1","choices":[{"finish_reason":"stop","index":0,"delta":{}}]}
data: [DONE]Python Streaming Example
from openai import OpenAI
client = OpenAI(
api_key=os.environ["YB_API_KEY"],
base_url="https://api.yieldingbear.com/v1",
)
stream = client.chat.completions.create(
model="groq/llama-3.1-8b-instant",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)Batch Requests
Process multiple prompts in a single API call. Batch requests are more efficient than making individual calls and can reduce overall latency for bulk operations.
Endpoint
POST /v1/batch
curl -X POST https://api.yieldingbear.com/v1/batch -H "Authorization: Bearer $YB_API_KEY" -H "Content-Type: application/json" -d '{
"requests": [
{"model": "groq/llama-3.1-8b-instant", "messages": [{"role": "user", "content": "Hi"}]},
{"model": "groq/llama-3.1-8b-instant", "messages": [{"role": "user", "content": "Bye"}]}
]
}'Use Cases
Batch classification
Label 1000s of texts at once
Bulk translation
Translate documents in one call
Parallel summarization
Summarize a batch of articles
Data enrichment
Add AI-generated fields to records
Guardrails
Built-in safety and content filtering on every request. Guardrails run at the API layer — they add zero latency and cost nothing extra.
Prompt Injection Prevention
Detects and blocks attempts to override system instructions through user input.
PII Redaction
Automatically detects and redacts personally identifiable information (SSN, credit cards, etc).
Harmful Content Filter
Blocks requests and responses containing violence, self-harm, or illegal content.
Toxicity Scoring
Returns a toxicity score for every response. Use it to filter or flag content.
Guardrail Response Example
{
"error": {
"message": "Request blocked by guardrails: prompt injection detected",
"type": "content_policy_violation",
"code": "prompt_injection_blocked"
}
}Error Handling
The API follows OpenAI's error format. All errors return a JSON body with a top-level error object. Always check for errors in your response handling.
{
"error": {
"message": "Invalid API key",
"type": "invalid_request_error",
"code": "authentication_error",
"param": null,
"status": 401
}
}| Status | Code | Description |
|---|---|---|
| 400 | invalid_request | Malformed request body or missing required fields |
| 401 | authentication_error | Invalid or missing API key |
| 402 | insufficient_credits | Not enough credits for this request |
| 403 | model_not_allowed | Model requires a higher tier or is disabled |
| 422 | invalid_parameter | Parameter value is invalid (e.g., temperature > 2) |
| 429 | rate_limit_exceeded | Too many requests. Check X-RateLimit-Reset header |
| 500 | server_error | Internal server error. Retry with exponential backoff |
| 502 | upstream_error | LLM provider returned an error. Try a different model |
| 503 | service_unavailable | Service temporarily unavailable. Try again shortly |
Billing & Credits
Free tier includes unlimited requests to Groq, Cerebras, and Together models. Paid tier gives access to premium models (OpenAI, Anthropic, Google, DeepSeek, Mistral) with a 20% markup over provider pricing.
Free
BEST FOR STARTERSModels: Groq, Cerebras, Together
Access: Unlimited requests
Paid
MOST POPULARModels: All providers (100+ models)
Access: Credits-based, min $10 deposit
Enterprise
FOR TEAMSModels: Custom routing, dedicated support
Access: Volume discounts, SLA
Sample Pricing (Paid Tier)
SDKs & Libraries
Use any OpenAI-compatible SDK. Just change the base URL and add your API key. Official SDKs and community libraries are all supported.
Python
JavaScript
Go
Rust
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="groq/llama-3.1-8b-instant",
api_key=os.environ["YB_API_KEY"],
base_url="https://api.yieldingbear.com/v1",
)
response = llm.invoke("Hello!")Frequently Asked Questions
How does Yielding Bear save money?
Our unified API sits in front of 16+ LLM providers and routes each request to the cheapest model that can handle the task. For simple tasks like summarization, it routes to Llama 3B at $0.04/1M. For complex reasoning, it routes to Claude 70B only when needed. Most teams save 60-80% vs direct API costs.
Do I need multiple API keys?
No. One API key from Yielding Bear connects you to all providers. We handle provider credentials, rate limits, retries, and failover automatically.
Is the API OpenAI-compatible?
Yes. The API is a drop-in replacement for the OpenAI API. Change your base URL to https://api.yieldingbear.com/v1 and add your YB API key — everything else works the same.
What happens if a provider goes down?
YB automatically retries on an alternative provider. Configure fallback chains in the dashboard or let our routing engine handle it automatically.
Can I fine-tune models through YB?
Yes, we support fine-tuning through Together.ai and other providers. Contact us for custom fine-tuning on enterprise plans.
What's included in guardrails?
Prompt injection detection, PII redaction, harmful content filtering, and toxicity scoring — all included on every request at no extra cost.
How do I monitor usage?
Use the dashboard to track API usage, costs by model, and error rates. API access logs are available for 30 days on all plans.
Can I use YB in production today?
Yes. The API is production-ready with 99.9% uptime SLA on paid plans. Free tier is best for development and testing.