At Galadrim, I led a comprehensive study on the performance of various market LLMs (OpenAI, Anthropic, Mistral, Llama).
The goal was to define cost/performance benchmarks to guide our RAG architecture choices.
I developed an automated testing protocol measuring latency (TTFT, TPS) and response quality on specific tasks (summarization, extraction, classification).