LLMetrics is a comprehensive benchmarking tool designed to evaluate and compare the performance of Large Language Model (LLM) inference APIs across various providers. It measures key metrics such as Time-to-First-Token (TTFT), Time-Between-Tokens (TBT), and overall End-to-End (E2E) latency in a standardized testing environment.
- Standardized Testing: Uses fixed prompts, input tokens, and output tokens for consistent performance evaluation
- Provider Comparison: Benchmarks multiple LLM service providers, including cloud APIs (e.g., OpenAI, Anthropic, Cloudflare) and local servers (e.g., vLLM)
- Configurable Experiments: Define experiments via a JSON configuration file that specifies providers, models, number of requests, token sizes, and streaming mode
- Data Persistence: Stores benchmarking results in DynamoDB for scalable, historical tracking
- Visualization: Generates latency and CDF plots which are integrated into an interactive dashboard for actionable insights
- Automated Workflows: Scheduled experiments (e.g., weekly via AWS VM and GitHub Actions) ensure continuous performance monitoring
# Using HTTPS
git clone https://github.com/your-username/LLMetrics.git
# Using SSH
git clone [email protected]:your-username/LLMetrics.git
cd LLMetrics
pip install \-r requirements.txt
Create a .env file in the repository root with your API keys and credentials:
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
CLOUDFLARE_ACCOUNT_ID="your-cloudflare-account-id"
CLOUDFLARE_AI_TOKEN="your-cloudflare-ai-token"
TOGETHER_AI_API="your-together-ai-api-key"
OPEN_AI_API="your-openai-api-key"
ANTHROPIC_API="your-anthropic-api-key"
PERPLEXITY_AI_API="your-perplexity-ai-api-key"
HYPERBOLIC_API="your-hyperbolic-api-key"
GROQ_API_KEY="your-groq-api-key"
GEMINI_API_KEY="your-gemini-api-key"
AZURE_LLAMA_8B_API="your-azure-llama-8b-api-key"
AZURE_LLAMA_3.1_70B_API="your-azure-llama-70b-api-key"
MISTRAL_LARGE_API="your-mistral-large-api-key"
AWS_BEDROCK_ACCESS_KEY_ID="your-aws-bedrock-access-key-id"
AWS_BEDROCK_SECRET_ACCESS_KEY="your-aws-bedrock-secret-key"
AWS_BEDROCK_REGION="your-aws-bedrock-region"
DYNAMODB_ENDPOINT_URL="your-dynamodb-endpoint-url"
Create a config.json file to define your benchmarking experiment:
{ "providers":
[ "TogetherAI", "Cloudflare", "OpenAI", "Anthropic", "vLLM" \],
"models":
[ "common-model" ],
"num_requests": 100,
"input_tokens": 10,
"streaming": true,
"max_output": 100,
"verbose": true
}
python main.py -c config.json`
OR
python main.py -c config.json --vllm_ip <host ip>
LLMetrics saves plots (latency graphs - CDF Plots) in the designated output directory benchmark_graph
LLMetrics integrates with a CI/CD pipeline to run weekly experiments on an AWS VM:
- GitHub Actions: Automates weekly benchmarking runs and CI tests
- AWS VM: Provides a stable network environment for consistent benchmarking
- DynamoDB: Stores benchmark data securely for historical analysis
- Dashboard: An interactive dashboard deployed on GitHub Pages visualizes the results