Skip to content

Add CLI tool for inspecting and creating dataframe of the results on a give leaderboard #2174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
KennethEnevoldsen opened this issue Feb 26, 2025 · 29 comments · May be fixed by #2454
Open
Labels
enhancement New feature or request

Comments

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Feb 26, 2025

Currently it is quite hard to "just" get a data frame for inspecting the results.

We could add a CLI like:

mteb create-table 
  --results results/  # where to fetch the results from
  --models model1,model2,model3 # what models to include in the table
  --bechmark "mteb(eng, v1)" # what benchmark

This would also be useful when comparing models on the results repo as well.

@KennethEnevoldsen KennethEnevoldsen added the enhancement New feature or request label Feb 26, 2025
@lifu-tu
Copy link

lifu-tu commented Feb 27, 2025

Hi Kenneth, it looks like it is not supported yet. I got the following error message:

'create-table' (choose from 'run', 'available_tasks', 'available_benchmarks', 'create_meta')

@KennethEnevoldsen
Copy link
Contributor Author

Hi @lifu-tu this is a suggestion for a feature, not an implemented feature - I have made the main comment a bit more clear, but it sounds like it is something that you would like as well :)

@ayush1298
Copy link
Contributor

@KennethEnevoldsen @Samoed, I have thought of following solutions here:
Use the similar implementation as of make_leaderboard.py and get scores. Based on it, we will create a comparison table and saved it on maybe at another argument called saved_path. Are we expecting this only or am I missing something here?

@Samoed
Copy link
Member

Samoed commented Mar 16, 2025

Yes. I'm also suggest to support multiple output format xlsx, csv, md etc. Also there should be benchmarks as optional and use local path for building the table

@ayush1298
Copy link
Contributor

ayush1298 commented Mar 17, 2025

@Samoed , if benchmarks is optional then which results we will consider? Isn't that if benchmark is mentioned then, the tasks will be fixed because we consider only those tasks under that particular benchmark and take results of all models on those tasks only?

@Samoed
Copy link
Member

Samoed commented Mar 17, 2025

If benchmark specified then yes, otherwise all results can be taken from results dir

@ayush1298
Copy link
Contributor

If benchmark specified then yes, otherwise all results can be taken from results dir

So, we will consider results of all tasks for that particular model, right?
Also, how should the comparison table be, just a single name with results of each mentioned models on different tasks, right?

@Samoed
Copy link
Member

Samoed commented Mar 17, 2025

So, we will consider results of all tasks for that particular model, right?

Yes

Also, how should the comparison table be, just a single name with results of each mentioned models on different tasks, right?

I think like

subset split task 1 task 2
model name subset 1 test 50.55 50.55

But I think that we can also set aggregation level for slit and subset

@ayush1298
Copy link
Contributor

ayush1298 commented Mar 22, 2025

@Samoed What exactly subset is here? and can you clarify what exactly you mean by aggregation level for split and subset

Also, won't split be depend on task and not model, so it will be different for each task?

@Samoed
Copy link
Member

Samoed commented Mar 22, 2025

What exactly subset is here?

Tasks have subsets eval_langs which are subsets of task and all results have these subsets too.

Also, won't split be depend on task and not model, so it will be different for each task?

Yes

can you clarify what exactly you mean by aggregation level for split and subset

I think we can have parameter aggregation_level for this feature that will average scores for selected level

  • split - no aggregation
  • subset - aggregate subset results of different split. E.g. mean score of test and val subsets.
  • task - aggregate results on subsets and splits

@ayush1298
Copy link
Contributor

Also, won't split be depend on task and not model, so it will be different for each task?

Yes

So, the split field should be different for each task, then we should not have that split column, right?

@Samoed
Copy link
Member

Samoed commented Mar 22, 2025

Also, won't split be depend on task and not model, so it will be different for each task?

Yes

So, the split field should be different for each task, then we should not have that split column, right?

No, we should have split column if selected split aggregation

@ayush1298
Copy link
Contributor

ayush1298 commented Mar 22, 2025

No, we should have split column if selected split aggregation

I am still not able to understand it clearly.
Are you saying that if aggregation is 'split' then for a given model, we will have subsets, and for each subsets we will have separate results of split, like following:

subset split task 1 task 2
model name subset 1 test 50.55 50.55
dev 50.55 50.55
subset 2 test 50.55 50.55
dev 50.55 50.55

For subset aggregation level, we will take aggregate of different splits like dev, test so we will have table like:

subset task 1 task 2
model name subset 1 50.55 50.55
subset 2 50.55 50.55

For task aggregation level, we will take aggregate of different splits like dev, test so we will have table like:

task 1 task 2
model name1 50.55 50.55
model name2 50.55 50.55

@Samoed
Copy link
Member

Samoed commented Mar 22, 2025

Yes, like this!

@ayush1298
Copy link
Contributor

@Samoed , I have few more doubts:

  1. Is the subsets which you are talking about is hf_subsets only?
  2. Are the subsets same for all tasks? Or is it possible that some tasks have some subsets while other have something different. Because if same subsets are not present for all tasks, then how we will be having split aggregation_level?
  3. What exactly hierarchy is here? Is it that for a particular model, we have results of different tasks. For each tasks, we have results for different splits, and under each split we have results for different subsets?
  4. Also, are the subsets same for different splits for a particular task?

@Samoed
Copy link
Member

Samoed commented Mar 24, 2025

  1. Yes
  2. Tasks have different subsets. Aggregation should aggregate results results inside all tasks. E. g. for each task mean of all task's splits or subsets
  3. Yes, that right
  4. No, subsets are different too

@ayush1298
Copy link
Contributor

ayush1298 commented Mar 24, 2025

Can you once give me how table look like for subset and split aggregation_level. Because I think table that I have mentioned previously was incorrect as I assumed there that all tasks have same subsets.
Also, Is task aggregation_level table is correct and it is just same as what we have Performance per task table on leaderboard?

@Samoed
Copy link
Member

Samoed commented Mar 24, 2025

I see problem now. We can create multi-index (but it's a bit difficult to work) or transpose it.

task task 1 task 1 task 1 task 1
subset subset 1 subset 2 subset 1 subset 2
split test dev test dev
model name 1 50 50 50 50

For subset aggregation leve

task task 1 task 1
subset subset 1 subset 2
model name 1 50 50

For task aggregation level

task task 1
model name 1 50

Example of transpose table

task subset split model name 1 model name 2
task 1 subset 1 test 50 50
dev 50 50
subset 2 test 50 50
dev 50 50
task subset model name 1 model name 2
task 1 subset 1 50 50
subset 2 50 50
task model name 1 model name 2
task 1 50 50

@KennethEnevoldsen What do you think?

@ayush1298
Copy link
Contributor

@Samoed

Is it that for a particular model, we have results of different tasks. For each tasks, we have results for different splits, and under each split we have results for different subsets?

If this is hierarchy, then for transpose tables, shouldn't we have in 1st table, 1st split column and then subset column. and also in 2nd table split column only

@Samoed
Copy link
Member

Samoed commented Mar 24, 2025

It's not directly transpose, but just changing axis. We have hierarchy task->subset->split and it should follow it

import pandas as pd
import numpy as np

arrays = [
    ["task 1"] * 4,
    ["subset 1"] * 2 + ["subset 2"] * 2,
    ["test", "dev"] * 2,
]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["task", "subset", "split"])
df = pd.DataFrame(np.random.randn(4), index=index, columns=["model 1"])
                        model 1
task   subset   split
task 1 subset 1 test  -0.556665
                dev   -0.370966
       subset 2 test   0.107148
                dev    0.622765

@ayush1298
Copy link
Contributor

ayush1298 commented Mar 24, 2025

We have hierarchy task->subset->split and it should follow it

Is it that for a particular model, we have results of different tasks. For each tasks, we have results for different splits, and under each split we have results for different subsets

You mentioned that it was task->split->subset when I asked this before. And also for any results files from results repo, I am seeing this hierarchy only.

@Samoed
Copy link
Member

Samoed commented Mar 24, 2025

You mentioned that it was task->split->subset when I asked this before.

Sorry for the confusion, but the correct order should be task -> subset -> split.

And also for any results files from results repo, I am seeing this hierarchy only.

Yes, but they have subsets inside

@ayush1298
Copy link
Contributor

Yes, but they have subsets inside

so they are inside split only. How then hierarchy is: task -> subset -> split

@ayush1298
Copy link
Contributor

@KennethEnevoldsen, I was working on this one. It will be good if you can have an opinion on the above discussion.

@KennethEnevoldsen
Copy link
Contributor Author

I think the suggestion by @Samoed proposed here is spot on.

So task -> subset -> split.

Anything I missed?

@ayush1298
Copy link
Contributor

I think the suggestion by @Samoed proposed here is spot on.

Just wanted to confirm whether we should go with this or not?

So task -> subset -> split.

Just have confusion here in this hierarchy. In json files of any results, I am seeing hierarchy as:task->split->subset and I am referring to subset as hf_subset

@KennethEnevoldsen
Copy link
Contributor Author

Ah good point. Follow the JSON file:

task -> split -> subset

@ayush1298
Copy link
Contributor

ayush1298 commented Mar 27, 2025

Ah good point. Follow the JSON file:

task -> split -> subset

If thats case, then here, the order column should also follow that? And in 1st table, we should have 1st split column and then subset column. and also in 2nd table split column only. So, it will be as follows:

I am renaming this as subset instead of split as we have results for each subset under each split:

task split subset model1 model2
task1 test subset1 50.55 50.55
subset2 50.55 50.55
dev subset1 50.55 50.55
subset2 50.55 50.55

Renaming this as split aggregation level, as we will take aggregate of different subsets for each split. Table will look like:

task subset model1 model2
task1 subset1 50.55 50.55
subset2 50.55 50.55

Basically I am defining aggregation level as level upto which we want to take aggregate results. Task means results aggregate upto each task which can be obtained by aggregating results of subsets over all splits. split aggregation level means, we will take aggregate upto split, means aggregate of all subsets under each split. and 'subset` aggregation level means we will not take any aggregate and show results for each subset under each split.

Let me know if this sounds correct

@KennethEnevoldsen
Copy link
Contributor Author

Sounds correct!

@ayush1298 ayush1298 linked a pull request Mar 28, 2025 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants