Evaluating LLMs has emerged as a pivotal problem in advancing the reliability and utility of synthetic intelligence throughout each tutorial and industrial settings. Because the capabilities of those fashions develop, so too does the necessity for rigorous, reproducible, and multi-faceted analysis methodologies. On this tutorial, we offer a complete examination of one of many area’s most crucial frontiers: systematically evaluating the strengths and limitations of LLMs throughout varied dimensions of efficiency. Utilizing Google’s cutting-edge Generative AI fashions as benchmarks and the LangChain library as our orchestration instrument, we current a sturdy and modular analysis pipeline tailor-made for implementation in Google Colab. This framework integrates criterion-based scoring, encompassing correctness, relevance, coherence, and conciseness, with pairwise mannequin comparisons and wealthy visible analytics to ship nuanced and actionable insights. Grounded in expert-validated query units and goal floor reality solutions, this method balances quantitative rigor with sensible adaptability, providing researchers and builders a ready-to-use, extensible toolkit for high-fidelity LLM analysis.
!pip set up langchain langchain-google-genai ragas pandas matplotlib
We set up key Python libraries for constructing and operating AI-powered workflows, LangChain for orchestrating LLM interactions (with the langchain-google-genai extension for Google’s generative AI), Ragas for retrieval-augmented technology, and pandas plus matplotlib for information manipulation and visualization.
import os
import pandas as pd
import matplotlib.pyplot as plt
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.analysis import load_evaluator
from langchain.schema import HumanMessage
We incorporate core Python utilities, together with os for setting administration, pandas for dealing with DataFrames, and matplotlib.pyplot for plotting, alongside LangChain’s Google Generative AI shopper, immediate templating, chain development, evaluator loader, and the HumanMessage schema to construct and assess conversational LLM pipelines.
os.environ["GOOGLE_API_KEY"] = "Use Your API Key"
Right here, we configure your setting by storing your Google API key within the GOOGLE_API_KEY variable, permitting the LangChain Google Generative AI shopper to authenticate requests securely.
def create_evaluation_dataset():
"""Create a easy dataset for analysis."""
questions = [
"Explain the concept of quantum computing in simple terms.",
"How does a neural network learn?",
"What are the main differences between SQL and NoSQL databases?",
"Explain how blockchain technology works.",
"What is the difference between supervised and unsupervised learning?"
]
ground_truth = [
"Quantum computing uses quantum bits or qubits that can exist in multiple states simultaneously, unlike classical bits. This allows quantum computers to process certain types of information much faster than classical computers for specific problems.",
"Neural networks learn through a process called backpropagation where they adjust the weights between neurons based on the error between predicted and actual outputs, gradually minimizing this error through many iterations of training data.",
"SQL databases are relational with structured schemas, fixed tables, and use SQL for queries. NoSQL databases are non-relational, schema-flexible, and designed for specific data models like document, key-value, wide-column, or graph formats.",
"Blockchain is a distributed ledger technology where data is stored in blocks that are linked cryptographically. Each block contains transaction data and a timestamp, creating an immutable chain. Consensus mechanisms verify transactions without central authority.",
"Supervised learning uses labeled data where the algorithm learns to predict outputs based on input-output pairs. Unsupervised learning works with unlabeled data to find patterns or structures without predefined outputs."
]
return pd.DataFrame({"query": questions, "ground_truth": ground_truth})
We assemble a small analysis DataFrame by pairing 5 instance questions on AI and database ideas with their corresponding floor‑reality solutions, making it straightforward to benchmark an LLM’s responses towards predefined appropriate outputs.
def setup_models():
"""Arrange completely different Google Generative AI fashions for comparability."""
fashions = {
"gemini-2.0-flash-lite": ChatGoogleGenerativeAI(mannequin="gemini-2.0-flash-lite", temperature=0),
"gemini-2.0-flash": ChatGoogleGenerativeAI(mannequin="gemini-2.0-flash", temperature=0)
}
return fashions
Now, this perform instantiates two zero‑temperature ChatGoogleGenerativeAI shoppers, one utilizing the light-weight “gemini‑2.0‑flash‑lite” mannequin and the opposite the complete “gemini‑2.0‑flash” mannequin, so you may simply examine their outputs aspect‑by‑aspect.
def generate_responses(fashions, dataset):
"""Generate responses from every mannequin for the questions within the dataset."""
responses = {}
for model_name, mannequin in fashions.gadgets():
model_responses = []
for query in dataset["question"]:
attempt:
response = mannequin.invoke([HumanMessage(content=question)])
model_responses.append(response.content material)
besides Exception as e:
print(f"Error with mannequin {model_name} on query: {query}")
print(f"Error: {e}")
model_responses.append("Error producing response")
responses[model_name] = model_responses
return responses
This perform loops by every configured mannequin and every query within the dataset, invokes the mannequin to generate a response, catches any errors (logging them and inserting a placeholder), and returns a dictionary mapping every mannequin’s title to its checklist of generated solutions.
def evaluate_responses(fashions, dataset, responses):
"""Consider mannequin responses utilizing completely different analysis standards."""
evaluator_model = ChatGoogleGenerativeAI(mannequin="gemini-2.0-flash-lite", temperature=0)
reference_criteria = ["correctness"]
reference_free_criteria = [
"relevance",
"coherence",
"conciseness"
]
outcomes = {model_name: {criterion: [] for criterion in reference_criteria + reference_free_criteria}
for model_name in fashions.keys()}
for criterion in reference_criteria:
evaluator = load_evaluator("labeled_criteria", standards=criterion, llm=evaluator_model)
for model_name in fashions.keys():
for i, query in enumerate(dataset["question"]):
ground_truth = dataset["ground_truth"][i]
response = responses[model_name][i]
if response != "Error producing response":
eval_result = evaluator.evaluate_strings(
prediction=response,
reference=ground_truth,
enter=query
)
normalized_score = float(eval_result.get('rating', 0)) * 2
outcomes[model_name][criterion].append(normalized_score)
else:
outcomes[model_name][criterion].append(0)
for criterion in reference_free_criteria:
evaluator = load_evaluator("standards", standards=criterion, llm=evaluator_model)
for model_name in fashions.keys():
for i, query in enumerate(dataset["question"]):
response = responses[model_name][i]
if response != "Error producing response":
eval_result = evaluator.evaluate_strings(
prediction=response,
enter=query
)
normalized_score = float(eval_result.get('rating', 0)) * 2
outcomes[model_name][criterion].append(normalized_score)
else:
outcomes[model_name][criterion].append(0)
return outcomes
This perform leverages a “gemini‑2.0‑flash‑lite” evaluator to attain every mannequin’s solutions on each reference‑primarily based correctness and reference‑free metrics (relevance, coherence, conciseness), normalizes these scores, and returns a nested dictionary mapping every mannequin and criterion to its checklist of analysis outcomes.
def calculate_average_scores(evaluation_results):
"""Calculate common scores for every mannequin and criterion."""
avg_scores = {}
for model_name, standards in evaluation_results.gadgets():
avg_scores[model_name] = {}
for criterion, scores in standards.gadgets():
if scores:
avg_scores[model_name][criterion] = sum(scores) / len(scores)
else:
avg_scores[model_name][criterion] = 0
all_scores = [score for criterion_scores in criteria.values() for score in criterion_scores if score is not None]
if all_scores:
avg_scores[model_name]["overall"] = sum(all_scores) / len(all_scores)
else:
avg_scores[model_name]["overall"] = 0
return avg_scores
This perform processes the nested analysis outcomes to compute the imply rating for every criterion throughout all questions for each mannequin. Additionally, it calculates an general common by pooling all particular person metric scores. The returned dictionary maps every mannequin to its per‑criterion averages and an aggregated “general” efficiency rating.
def visualize_results(avg_scores):
"""Visualize analysis outcomes with bar charts."""
fashions = checklist(avg_scores.keys())
standards = checklist(avg_scores[models[0]].keys())
plt.determine(figsize=(14, 8))
bar_width = 0.8 / len(fashions)
positions = vary(len(standards))
for i, mannequin in enumerate(fashions):
model_scores = [avg_scores[model][criterion] for criterion in standards]
plt.bar([p + i * bar_width for p in positions], model_scores,
width=bar_width, label=mannequin)
plt.xlabel('Analysis Standards', fontsize=12)
plt.ylabel('Common Rating (0-10)', fontsize=12)
plt.title('LLM Mannequin Comparability by Analysis Standards', fontsize=14)
plt.xticks([p + bar_width * (len(models) - 1) / 2 for p in positions], standards)
plt.legend()
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.tight_layout()
plt.present()
plt.determine(figsize=(10, 8))
classes = [c for c in criteria if c != 'overall']
N = len(classes)
angles = [n / float(N) * 2 * 3.14159 for n in range(N)]
angles += angles[:1]
plt.polar(angles, [0] * (N + 1))
plt.xticks(angles[:-1], classes)
for mannequin in fashions:
values = [avg_scores[model][c] for c in classes]
values += values[:1]
plt.polar(angles, values, label=mannequin)
plt.legend(loc="higher proper")
plt.title('LLM Mannequin Comparability - Radar Chart', fontsize=14)
plt.tight_layout()
plt.present()
This perform creates side-by-side bar charts to match every mannequin’s common scores throughout all analysis standards. Then it renders a radar chart to visualise their efficiency profiles, enabling fast identification of relative strengths and weaknesses.
def foremost():
print("Creating analysis dataset...")
dataset = create_evaluation_dataset()
print("Establishing fashions...")
fashions = setup_models()
print("Producing responses...")
responses = generate_responses(fashions, dataset)
print("Evaluating responses...")
evaluation_results = evaluate_responses(fashions, dataset, responses)
print("Calculating common scores...")
avg_scores = calculate_average_scores(evaluation_results)
print("Common scores:")
for mannequin, scores in avg_scores.gadgets():
print(f"n{mannequin}:")
for criterion, rating in scores.gadgets():
print(f" {criterion}: {rating:.2f}")
print("nVisualizing outcomes...")
visualize_results(avg_scores)
print("Saving outcomes to CSV...")
results_df = pd.DataFrame(columns=["Model", "Criterion", "Score"])
for mannequin, standards in avg_scores.gadgets():
for criterion, rating in standards.gadgets():
results_df = pd.concat([results_df, pd.DataFrame([{"Model": model, "Criterion": criterion, "Score": score}])],
ignore_index=True)
results_df.to_csv("llm_evaluation_results.csv", index=False)
print("Outcomes saved to llm_evaluation_results.csv")
detailed_df = pd.DataFrame(columns=["Question", "Ground Truth"] + checklist(fashions.keys()))
for i, query in enumerate(dataset["question"]):
row = {
"Query": query,
"Floor Fact": dataset["ground_truth"][i]
}
for model_name in fashions.keys():
row[model_name] = responses[model_name][i]
detailed_df = pd.concat([detailed_df, pd.DataFrame([row])], ignore_index=True)
detailed_df.to_csv("llm_response_comparison.csv", index=False)
print("Detailed responses saved to llm_response_comparison.csv")
The primary perform orchestrates your entire analysis workflow finish‑to‑finish: it builds the dataset, initializes fashions, generates and scores responses, computes and shows common metrics, visualizes efficiency with charts, and at last exports each abstract and detailed outcomes as CSV information.
def pairwise_model_comparison(fashions, dataset, responses):
"""Evaluate two fashions aspect by aspect utilizing an LLM as decide."""
evaluator_model = ChatGoogleGenerativeAI(mannequin="gemini-2.0-flash-lite", temperature=0)
pairwise_template = """
Query: {query}
Response A: {response_a}
Response B: {response_b}
Which response higher solutions the consumer's query? Take into account elements like accuracy,
helpfulness, readability, and completeness.
First, analyze every response level by level. Then conclude together with your alternative of both:
A is best, B is best, or They're equally good/unhealthy.
Your evaluation:
"""
pairwise_prompt = PromptTemplate(
input_variables=["question", "response_a", "response_b"],
template=pairwise_template
)
pairwise_chain = LLMChain(llm=evaluator_model, immediate=pairwise_prompt)
model_names = checklist(fashions.keys())
pairwise_results = {f"{model_a} vs {model_b}": [] for model_a in model_names for model_b in model_names if model_a != model_b}
for i, query in enumerate(dataset["question"]):
for j, model_a in enumerate(model_names):
for model_b in model_names[j+1:]:
response_a = responses[model_a][i]
response_b = responses[model_b][i]
if response_a != "Error producing response" and response_b != "Error producing response":
comparison_result = pairwise_chain.run(
query=query,
response_a=response_a,
response_b=response_b
)
key_ab = f"{model_a} vs {model_b}"
pairwise_results[key_ab].append({
"query": query,
"outcome": comparison_result
})
return pairwise_results
This perform runs head-to-head comparisons for every distinctive mannequin pair by prompting a “gemini-2.0-flash-lite” decide to investigate and rank their responses on accuracy, readability, and completeness, accumulating per-question verdicts right into a structured dictionary for side-by-side analysis.
def enhanced_main():
"""Enhanced foremost perform with extra evaluations."""
print("Creating analysis dataset...")
dataset = create_evaluation_dataset()
print("Establishing fashions...")
fashions = setup_models()
print("Producing responses...")
responses = generate_responses(fashions, dataset)
print("Evaluating responses...")
evaluation_results = evaluate_responses(fashions, dataset, responses)
print("Calculating common scores...")
avg_scores = calculate_average_scores(evaluation_results)
print("Common scores:")
for mannequin, scores in avg_scores.gadgets():
print(f"n{mannequin}:")
for criterion, rating in scores.gadgets():
print(f" {criterion}: {rating:.2f}")
print("nVisualizing outcomes...")
visualize_results(avg_scores)
print("nPerforming pairwise mannequin comparability...")
pairwise_results = pairwise_model_comparison(fashions, dataset, responses)
print("nPairwise comparability outcomes:")
for comparability, ends in pairwise_results.gadgets():
print(f"n{comparability}:")
for i, end in enumerate(outcomes[:2]):
print(f" Query {i+1}: {outcome['question']}")
print(f" Evaluation: {outcome['result'][:100]}...")
print("nSaving all outcomes...")
results_df = pd.DataFrame(columns=["Model", "Criterion", "Score"])
for mannequin, standards in avg_scores.gadgets():
for criterion, rating in standards.gadgets():
results_df = pd.concat([results_df, pd.DataFrame([{"Model": model, "Criterion": criterion, "Score": score}])],
ignore_index=True)
results_df.to_csv("llm_evaluation_results.csv", index=False)
detailed_df = pd.DataFrame(columns=["Question", "Ground Truth"] + checklist(fashions.keys()))
for i, query in enumerate(dataset["question"]):
row = {
"Query": query,
"Floor Fact": dataset["ground_truth"][i]
}
for model_name in fashions.keys():
row[model_name] = responses[model_name][i]
detailed_df = pd.concat([detailed_df, pd.DataFrame([row])], ignore_index=True)
detailed_df.to_csv("llm_response_comparison.csv", index=False)
pairwise_df = pd.DataFrame(columns=["Comparison", "Question", "Analysis"])
for comparability, ends in pairwise_results.gadgets():
for end in outcomes:
pairwise_df = pd.concat([pairwise_df, pd.DataFrame([{
"Comparison": comparison,
"Question": result["question"],
"Evaluation": outcome["result"]
}])], ignore_index=True)
pairwise_df.to_csv("llm_pairwise_comparison.csv", index=False)
print("All outcomes saved to CSV information.")
The enhanced_main perform extends the core analysis pipeline by including automated pairwise mannequin comparisons, printing concise progress updates at every stage, and exporting three CSV information, abstract scores, detailed responses, and pairwise evaluation , so you find yourself with an entire, side-by-side analysis workspace.
if __name__ == "__main__":
enhanced_main()
Lastly, this guard ensures that when the script is executed straight (not imported), it calls enhanced_main() to run the complete analysis and comparability pipeline finish‑to‑finish.
In conclusion, on this tutorial has launched a flexible and principled framework for evaluating and evaluating the efficiency of LLMs, leveraging Google’s Generative AI capabilities alongside the LangChain library for orchestration. Not like simplistic accuracy-based metrics, the methodology offered right here embraces the multidimensional nature of language understanding, combining granular criterion-based analysis, structured model-to-model comparability, and intuitive visualizations. By capturing key attributes, together with correctness, relevance, coherence, and conciseness, our analysis pipeline permits practitioners to determine delicate but important efficiency variations that straight influence downstream functions. The outputs, together with CSV-based reporting, radar plots, and bar graphs, not solely assist clear benchmarking but in addition information data-driven decision-making in mannequin choice and deployment.
Right here is the Colab Notebook. Additionally, don’t neglect to observe us on Twitter and be part of our Telegram Channel and LinkedIn Group. Don’t Neglect to hitch our 90k+ ML SubReddit.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.