-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
100 lines (75 loc) · 3.69 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# vitals <a href="https://vitals.tidyverse.org"><img src="man/figures/logo.png" align="right" height="240" alt="vitals website" /></a>
<!-- badges: start -->
[](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[](https://CRAN.R-project.org/package=vitals)
[](https://github.com/tidyverse/vitals/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
vitals is a framework for large language model evaluation in R. It's specifically aimed at [ellmer](https://ellmer.tidyverse.org/) users who want to measure the effectiveness of their LLM-based apps.
The package is an R port of the widely adopted Python framework [Inspect](https://inspect.ai-safety-institute.org.uk/). While the package doesn't integrate with Inspect directly, it allows users to interface with the [Inspect log viewer](https://inspect.ai-safety-institute.org.uk/log-viewer.html) and provides an on-ramp to transition to Inspect if need be by writing evaluation logs to the same file format.
> **Important**
>
> 🚧 Under construction! 🚧
>
> vitals is highly experimental and much of its documentation is aspirational.
## Installation
You can install the developmental version of vitals using:
```r
pak::pak("tidyverse/vitals")
```
## Example
LLM evaluation with vitals is composed of two main steps.
```{r}
library(vitals)
library(ellmer)
library(tibble)
```
1) First, create an evaluation **task** with the `Task$new()` method.
```{r}
#| label: tsk-new
simple_addition <- tibble(
input = c("What's 2+2?", "What's 2+3?", "What's 2+4?"),
target = c("4", "5", "6")
)
tsk <- Task$new(
dataset = simple_addition,
solver = generate(chat_anthropic(model = "claude-3-7-sonnet-latest")),
scorer = model_graded_qa()
)
```
Tasks are composed of three main components:
* **Datasets** are a data frame with, minimally, columns `input` and `target`. `input` represents some question or problem, and `target` gives the target response.
* **Solvers** are functions that take `input` and return some value approximating `target`, likely wrapping ellmer chats. `generate()` is the simplest scorer in vitals, and just passes the `input` to the chat's `$chat()` method, returning its result as-is.
* **Scorers** juxtapose the solvers' output with `target`, evaluating how well the solver solved the `input`.
2) Evaluate the task.
```{r}
#| label: tsk-eval
#| eval: false
tsk$eval()
```
`$eval()` will run the solver, run the scorer, and then situate the results in a persistent log file that can be explored interactively with the Inspect log viewer.
```{r}
#| label: tsk-view
#| echo: false
#| fig-alt: "A screenshot of the Inspect log viewer, an interactive app displaying information on the 3 samples evaluated in this eval."
knitr::include_graphics("man/figures/log_viewer.png")
```
Any arguments to the solver or scorer can be passed to `$eval()`, allowing for straightforward parameterization of tasks. For example, if I wanted to evaluate `chat_openai()` on this task rather than `chat_anthropic()`, I could write:
```{r}
#| label: tsk-openai
#| eval: false
tsk_openai <- tsk$clone()
tsk_openai$eval(solver_chat = chat_openai(model = "gpt-4o"))
```
For an applied example, see the "Getting started with vitals" vignette at `vignette("vitals", package = "vitals")`.