How to choose the best LLM using R and vitals

Swap in another LLM

There are several ways to run the same task with a different model. First, create a new chat object with that different model. Here’s the code for checking out Google Gemini 3 Flash Preview:

my_chat_gemini

Then you can run the task in one of three ways.

1. Clone an existing task and add the chat as its solver with $set_solver():

my_task_gemini

2. Clone an existing task and add the new chat as a solver when you run it:

my_task_gemini

3. Create a new task from scratch, which allows you to include a new name:

my_task_gemini

Make sure you’ve set your API key for each provider you want to test, unless you’re using a platform that doesn’t need them, such as local LLMs with ollama.

View multiple task runs

Once you’ve run multiple tasks with different models, you can use the vitals_bind() function to combine the results:

both_tasks

Screenshot of combined task results running each LLM with three epochs. — Example of combined task results running each LLM with three epochs.

Sharon Machlis

This returns an R data frame with columns for task, id, epoch, score, and metadata. The metadata column contains a data frame in each row with columns for input, target, result, solver_chat, scorer_chat, scorer_metadata, and scorer.

To flatten the input, target, and result columns and make them easier to scan and analyze, I un-nested the metadata column with:

library(tidyr)
both_tasks_wide 
  unnest_longer(metadata) |>
  unnest_wider(metadata)

I was then able to run a quick script to cycle through each bar-chart result code and see what it produced:

library(dplyr)

# Some results are surrounded by markdown and that markdown code needs to be removed or the R code won't run
extract_code 
  filter(id == "barchart")

# Loop through each result
for (i in seq_len(nrow(barchart_results))) {
  code_to_run

Test local LLMs

This is one of my favorite use cases for vitals. Currently, models that fit into my PC’s 12GB of GPU RAM are rather limited. But I’m hopeful that small models will soon be useful for more tasks I’d like to do locally with sensitive data. Vitals makes it easy for me to test new LLMs on some of my specific use cases.

vitals (via ellmer) supports ollama, a popular way of running LLMs locally. To use ollama, download, install, and run the ollama application, and either use the desktop app or a terminal window to run it. The syntax is ollama pull to download an LLM, or ollama run to both download and start a chat if you’d like to make sure the model works on your system. For example: ollama pull ministral-3:14b.

The rollama R package lets you download a local LLM for ollama within R, as long as ollama is running. The syntax is rollama::pull_model("model-name"). For example, rollama::pull_model("ministral-3:14b"). You can test whether R can see ollama running on your system with rollama::ping_ollama().

I also pulled Google’s gemma3-12b and Microsoft’s phi4, then created tasks for each of them with the same dataset I used before. Note that as of this writing, you need the dev version of vitals to handle LLM names that include colons (the next CRAN version after 0.2.0 should handle that, though):

# Create chat objects
ministral_chat

All three local LLMs nailed the sentiment analysis, and all did poorly on the bar chart. Some code produced bar charts but not with axes flipped and sorted in descending order; other code didn’t work at all.

Screenshot of results after running a dataset with gemma, minisral, and phi. — Results of one run of my dataset with five local LLMs.

Sharon Machlis

R code for the results table above:

library(dplyr)
library(gt)
library(scales)

# Prepare the data
plot_data 
  rename(LLM = task, task = id) |>
  group_by(LLM, task) |>
  summarize(
    pct_correct = mean(score == "C") * 100,
    .groups = "drop"
  )

color_fn 
  tidyr::pivot_wider(names_from = task, values_from = pct_correct) |>
  gt() |>
  tab_header(title = "Percent Correct") |>
  cols_label(`sentiment-analysis` = html("sentiment-
analysis")) |>
  data_color(
    columns = -LLM,
    fn = color_fn
  )

It cost me 39 cents for Opus to judge these local LLM runs—not a bad bargain.

Extract structured data from text

Vitals has a special function for extracting structured data from plain text: generate_structured(). It requires both a chat object and a defined data type you want the LLM to return. As of this writing, you need the development version of vitals to use the generate_structured() function.

First, here’s my new dataset to extract topic, speaker name and affiliation, date, and start time from a plain-text description. The more complex version asks the LLM to convert the time zone to Eastern Time from Central European Time:

extract_dataset R Package Development in Positron\r\nThursday, January 15th, 18:00 - 20:00 CET (Rome, Berlin, Paris timezone) \r\nStephen D. Turner is an associate professor of data science at the University of Virginia School of Data Science. Prior to re-joining UVA he was a data scientist in national security and defense consulting, and later at a biotech company (Colossal, the de-extinction company) where he built and deployed scores of R packages.",

    "Extract the workshop topic, speaker name, speaker affiliation, date in 'yyyy-mm-dd' format, and start time in Eastern Time zone in 'hh:mm ET' format from the text below. (TZ is the time zone). Assume the date year makes the most sense given that today's date is February 7, 2026. Return ONLY those entities in the format {topic}, {speaker name}, {date}, {start_time}. Convert the given time to Eastern Time if required. R Package Development in Positron\r\nThursday, January 15th, 18:00 - 20:00 CET (Rome, Berlin, Paris timezone) \r\nStephen D. Turner is an associate professor of data science at the University of Virginia School of Data Science. Prior to re-joining UVA he was a data scientist in national security and defense consulting, and later at a biotech company (Colossal, the de-extinction company) where he built and deployed scores of R packages."
  ),
  target = c(
    "R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 18:00. OR R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 18:00 CET.",
    "R Package Development in Positron, Stephen D. Turner, University of Virginia (or University of Virginia School of Data Science), 2026-01-15, 12:00 ET."
  )
)

Below is an example of how to define a data structure using ellmer’s type_object() function. Each of the arguments gives the name of a data field and its type (string, integer, and so on). I’m specifying I want to extract a workshop_topic, speaker_name, current_speaker_affiliation, date (as a string), and start_time (also as a string):

my_object

Next, I’ll use the chat objects I created earlier in a new structured data task, using Sonnet as the judge since grading is straightforward:

my_task_structured

It cost me 16 cents for Sonnet to judge 15 evaluation runs of two queries and results each.

Here are the results:

Screenshot of results after running the structured data task on gemini, gemma, gpt_5_nano, ministral and phi. — How various LLMs fared on extracting structured data from text.

Sharon Machlis

I was surprised that a local model, Gemma, scored 100%. I wanted to see if that was a fluke, so I ran the eval another 17 times for a total of 20. Weirdly, it missed on two of the 20 basic extractions by giving the title as “R Package Development” instead of “R Package Development in Positron,” but scored 100% on the more complex ones. I asked Claude Opus about that, and it said my “easier” task was more ambiguous for a less capable model to understand. Important takeaway: Be as specific as possible in your instructions!

Still, Gemma’s results were good enough on this task for me to consider testing it on some real-world entity extraction tasks. And I wouldn’t have known that without running automated evaluations on multiple local LLMs.

Conclusion

If you’re used to writing code that gives predictable, repeatable responses, a script that generates different answers each time it runs can feel unsettling. While there are no guarantees when it comes to predicting an LLM’s next response, evals can increase your confidence in your code by letting you run structured tests with measurable responses, instead of testing via manual, ad-hoc queries. And, as the model landscape keeps evolving, you can stay current by testing how newer LLMs perform—not on generic benchmarks, but on the tasks that matter most to you.

What's Hot

Salary, Eligibility, and Career in 2026!!

About 13,000 teaching posts vacant in J&K, no schools upgrade since 2024: Sakeena Itoo | Education News

How Intensive Mental Health Programs Can Transform Lives

How to choose the best LLM using R and vitals

What happens when you add AI to SAST

WinterTC: Write once, run anywhere (for real this time)

Community push intensifies to free MySQL from Oracle’s control amid stagnation fears

Salary, Eligibility, and Career in 2026!!

About 13,000 teaching posts vacant in J&K, no schools upgrade since 2024: Sakeena Itoo | Education News

How Intensive Mental Health Programs Can Transform Lives

CELUS: Business Development Representative

News

Usefull Links

Latest jobs

What's Hot

How to choose the best LLM using R and vitals

Swap in another LLM

View multiple task runs

Test local LLMs

Extract structured data from text

Conclusion

Learn more about the vitals R package

Related Posts

News

Usefull Links

Latest jobs

Subscribe to Updates