Add CLI tool for inspecting and creating dataframe of the results on a give leaderboard #2174

KennethEnevoldsen · 2025-02-26T20:29:46Z

Currently it is quite hard to "just" get a data frame for inspecting the results.

We could add a CLI like:

mteb create-table 
  --results results/  # where to fetch the results from
  --models model1,model2,model3 # what models to include in the table
  --bechmark "mteb(eng, v1)" # what benchmark

This would also be useful when comparing models on the results repo as well.

The text was updated successfully, but these errors were encountered:

lifu-tu · 2025-02-27T00:30:48Z

Hi Kenneth, it looks like it is not supported yet. I got the following error message:

'create-table' (choose from 'run', 'available_tasks', 'available_benchmarks', 'create_meta')

KennethEnevoldsen · 2025-02-27T20:12:11Z

Hi @lifu-tu this is a suggestion for a feature, not an implemented feature - I have made the main comment a bit more clear, but it sounds like it is something that you would like as well :)

ayush1298 · 2025-03-16T14:36:23Z

@KennethEnevoldsen @Samoed, I have thought of following solutions here:
Use the similar implementation as of make_leaderboard.py and get scores. Based on it, we will create a comparison table and saved it on maybe at another argument called saved_path. Are we expecting this only or am I missing something here?

Samoed · 2025-03-16T15:31:33Z

Yes. I'm also suggest to support multiple output format xlsx, csv, md etc. Also there should be benchmarks as optional and use local path for building the table

ayush1298 · 2025-03-17T06:20:27Z

@Samoed , if benchmarks is optional then which results we will consider? Isn't that if benchmark is mentioned then, the tasks will be fixed because we consider only those tasks under that particular benchmark and take results of all models on those tasks only?

Samoed · 2025-03-17T06:35:18Z

If benchmark specified then yes, otherwise all results can be taken from results dir

ayush1298 · 2025-03-17T07:11:20Z

If benchmark specified then yes, otherwise all results can be taken from results dir

So, we will consider results of all tasks for that particular model, right?
Also, how should the comparison table be, just a single name with results of each mentioned models on different tasks, right?

Samoed · 2025-03-17T08:04:30Z

So, we will consider results of all tasks for that particular model, right?

Yes

Also, how should the comparison table be, just a single name with results of each mentioned models on different tasks, right?

I think like

	subset	split	task 1	task 2
model name	subset 1	test	50.55	50.55

But I think that we can also set aggregation level for slit and subset

ayush1298 · 2025-03-22T06:30:34Z

@Samoed What exactly subset is here? and can you clarify what exactly you mean by aggregation level for split and subset

Also, won't split be depend on task and not model, so it will be different for each task?

Samoed · 2025-03-22T11:29:29Z

What exactly subset is here?

Tasks have subsets eval_langs which are subsets of task and all results have these subsets too.

Also, won't split be depend on task and not model, so it will be different for each task?

Yes

can you clarify what exactly you mean by aggregation level for split and subset

I think we can have parameter aggregation_level for this feature that will average scores for selected level

split - no aggregation
subset - aggregate subset results of different split. E.g. mean score of test and val subsets.
task - aggregate results on subsets and splits

ayush1298 · 2025-03-22T13:08:00Z

Also, won't split be depend on task and not model, so it will be different for each task?

Yes

So, the split field should be different for each task, then we should not have that split column, right?

Samoed · 2025-03-22T13:26:29Z

Also, won't split be depend on task and not model, so it will be different for each task?

Yes

So, the split field should be different for each task, then we should not have that split column, right?

No, we should have split column if selected split aggregation

ayush1298 · 2025-03-22T13:38:59Z

No, we should have split column if selected split aggregation

I am still not able to understand it clearly.
Are you saying that if aggregation is 'split' then for a given model, we will have subsets, and for each subsets we will have separate results of split, like following:

	subset	split	task 1	task 2
model name	subset 1	test	50.55	50.55
		dev	50.55	50.55
	subset 2	test	50.55	50.55
		dev	50.55	50.55

For subset aggregation level, we will take aggregate of different splits like dev, test so we will have table like:

	subset	task 1	task 2
model name	subset 1	50.55	50.55
	subset 2	50.55	50.55

For task aggregation level, we will take aggregate of different splits like dev, test so we will have table like:

	task 1	task 2
model name1	50.55	50.55
model name2	50.55	50.55

Samoed · 2025-03-22T13:58:15Z

Yes, like this!

ayush1298 · 2025-03-24T10:30:08Z

@Samoed , I have few more doubts:

Is the subsets which you are talking about is hf_subsets only?
Are the subsets same for all tasks? Or is it possible that some tasks have some subsets while other have something different. Because if same subsets are not present for all tasks, then how we will be having split aggregation_level?
What exactly hierarchy is here? Is it that for a particular model, we have results of different tasks. For each tasks, we have results for different splits, and under each split we have results for different subsets?
Also, are the subsets same for different splits for a particular task?

Samoed · 2025-03-24T10:36:05Z

Yes
Tasks have different subsets. Aggregation should aggregate results results inside all tasks. E. g. for each task mean of all task's splits or subsets
Yes, that right
No, subsets are different too

ayush1298 · 2025-03-24T11:09:43Z

Can you once give me how table look like for subset and split aggregation_level. Because I think table that I have mentioned previously was incorrect as I assumed there that all tasks have same subsets.
Also, Is task aggregation_level table is correct and it is just same as what we have Performance per task table on leaderboard?

Samoed · 2025-03-24T11:22:46Z

I see problem now. We can create multi-index (but it's a bit difficult to work) or transpose it.

task	task 1	task 1	task 1	task 1
subset	subset 1	subset 2	subset 1	subset 2
split	test	dev	test	dev
model name 1	50	50	50	50

For subset aggregation leve

task	task 1	task 1
subset	subset 1	subset 2
model name 1	50	50

For task aggregation level

task	task 1
model name 1	50

Example of transpose table

task	subset	split	model name 1	model name 2
task 1	subset 1	test	50	50
		dev	50	50
	subset 2	test	50	50
		dev	50	50

task	subset	model name 1	model name 2
task 1	subset 1	50	50
	subset 2	50	50

task	model name 1	model name 2
task 1	50	50

@KennethEnevoldsen What do you think?

ayush1298 · 2025-03-24T11:37:19Z

@Samoed

Is it that for a particular model, we have results of different tasks. For each tasks, we have results for different splits, and under each split we have results for different subsets?

If this is hierarchy, then for transpose tables, shouldn't we have in 1st table, 1st split column and then subset column. and also in 2nd table split column only

Samoed · 2025-03-24T11:57:02Z

It's not directly transpose, but just changing axis. We have hierarchy task->subset->split and it should follow it

import pandas as pd
import numpy as np

arrays = [
    ["task 1"] * 4,
    ["subset 1"] * 2 + ["subset 2"] * 2,
    ["test", "dev"] * 2,
]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["task", "subset", "split"])
df = pd.DataFrame(np.random.randn(4), index=index, columns=["model 1"])

                        model 1
task   subset   split
task 1 subset 1 test  -0.556665
                dev   -0.370966
       subset 2 test   0.107148
                dev    0.622765

ayush1298 · 2025-03-24T13:43:24Z

We have hierarchy task->subset->split and it should follow it

Is it that for a particular model, we have results of different tasks. For each tasks, we have results for different splits, and under each split we have results for different subsets

You mentioned that it was task->split->subset when I asked this before. And also for any results files from results repo, I am seeing this hierarchy only.

Samoed · 2025-03-24T13:53:10Z

You mentioned that it was task->split->subset when I asked this before.

Sorry for the confusion, but the correct order should be task -> subset -> split.

And also for any results files from results repo, I am seeing this hierarchy only.

Yes, but they have subsets inside

ayush1298 · 2025-03-24T17:39:45Z

Yes, but they have subsets inside

so they are inside split only. How then hierarchy is: task -> subset -> split

ayush1298 · 2025-03-26T13:47:59Z

@KennethEnevoldsen, I was working on this one. It will be good if you can have an opinion on the above discussion.

KennethEnevoldsen · 2025-03-27T12:34:50Z

I think the suggestion by @Samoed proposed here is spot on.

So task -> subset -> split.

Anything I missed?

ayush1298 · 2025-03-27T13:05:11Z

I think the suggestion by @Samoed proposed here is spot on.

Just wanted to confirm whether we should go with this or not?

So task -> subset -> split.

Just have confusion here in this hierarchy. In json files of any results, I am seeing hierarchy as:task->split->subset and I am referring to subset as hf_subset

KennethEnevoldsen · 2025-03-27T14:47:41Z

Ah good point. Follow the JSON file:

task -> split -> subset

ayush1298 · 2025-03-27T15:32:47Z

Ah good point. Follow the JSON file:

task -> split -> subset

If thats case, then here, the order column should also follow that? And in 1st table, we should have 1st split column and then subset column. and also in 2nd table split column only. So, it will be as follows:

I am renaming this as subset instead of split as we have results for each subset under each split:

task	split	subset	model1	model2
task1	test	subset1	50.55	50.55
		subset2	50.55	50.55
	dev	subset1	50.55	50.55
		subset2	50.55	50.55

Renaming this as split aggregation level, as we will take aggregate of different subsets for each split. Table will look like:

task	subset	model1	model2
task1	subset1	50.55	50.55
	subset2	50.55	50.55

Basically I am defining aggregation level as level upto which we want to take aggregate results. Task means results aggregate upto each task which can be obtained by aggregating results of subsets over all splits. split aggregation level means, we will take aggregate upto split, means aggregate of all subsets under each split. and 'subset` aggregation level means we will not take any aggregate and show results for each subset under each split.

Let me know if this sounds correct

KennethEnevoldsen · 2025-03-27T18:51:04Z

Sounds correct!

KennethEnevoldsen added the enhancement New feature or request label Feb 26, 2025

ayush1298 linked a pull request Mar 28, 2025 that will close this issue

CLI Tool for results dataframe on leaderboard #2454

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CLI tool for inspecting and creating dataframe of the results on a give leaderboard #2174

Add CLI tool for inspecting and creating dataframe of the results on a give leaderboard #2174

KennethEnevoldsen commented Feb 26, 2025 •

edited

Loading

lifu-tu commented Feb 27, 2025

KennethEnevoldsen commented Feb 27, 2025

ayush1298 commented Mar 16, 2025

Samoed commented Mar 16, 2025 •

edited

Loading

ayush1298 commented Mar 17, 2025 •

edited

Loading

Samoed commented Mar 17, 2025

ayush1298 commented Mar 17, 2025

Samoed commented Mar 17, 2025

ayush1298 commented Mar 22, 2025 •

edited

Loading

Samoed commented Mar 22, 2025

ayush1298 commented Mar 22, 2025

Samoed commented Mar 22, 2025

ayush1298 commented Mar 22, 2025 •

edited

Loading

Samoed commented Mar 22, 2025

ayush1298 commented Mar 24, 2025

Samoed commented Mar 24, 2025 •

edited

Loading

ayush1298 commented Mar 24, 2025 •

edited by Samoed

Loading

Samoed commented Mar 24, 2025 •

edited

Loading

ayush1298 commented Mar 24, 2025

Samoed commented Mar 24, 2025 •

edited

Loading

ayush1298 commented Mar 24, 2025 •

edited

Loading

Samoed commented Mar 24, 2025

ayush1298 commented Mar 24, 2025

ayush1298 commented Mar 26, 2025

KennethEnevoldsen commented Mar 27, 2025

ayush1298 commented Mar 27, 2025

KennethEnevoldsen commented Mar 27, 2025

ayush1298 commented Mar 27, 2025 •

edited

Loading

KennethEnevoldsen commented Mar 27, 2025

Add CLI tool for inspecting and creating dataframe of the results on a give leaderboard #2174

Add CLI tool for inspecting and creating dataframe of the results on a give leaderboard #2174

Comments

KennethEnevoldsen commented Feb 26, 2025 • edited Loading

lifu-tu commented Feb 27, 2025

KennethEnevoldsen commented Feb 27, 2025

ayush1298 commented Mar 16, 2025

Samoed commented Mar 16, 2025 • edited Loading

ayush1298 commented Mar 17, 2025 • edited Loading

Samoed commented Mar 17, 2025

ayush1298 commented Mar 17, 2025

Samoed commented Mar 17, 2025

ayush1298 commented Mar 22, 2025 • edited Loading

Samoed commented Mar 22, 2025

ayush1298 commented Mar 22, 2025

Samoed commented Mar 22, 2025

ayush1298 commented Mar 22, 2025 • edited Loading

Samoed commented Mar 22, 2025

ayush1298 commented Mar 24, 2025

Samoed commented Mar 24, 2025 • edited Loading

ayush1298 commented Mar 24, 2025 • edited by Samoed Loading

Samoed commented Mar 24, 2025 • edited Loading

ayush1298 commented Mar 24, 2025

Samoed commented Mar 24, 2025 • edited Loading

ayush1298 commented Mar 24, 2025 • edited Loading

Samoed commented Mar 24, 2025

ayush1298 commented Mar 24, 2025

ayush1298 commented Mar 26, 2025

KennethEnevoldsen commented Mar 27, 2025

ayush1298 commented Mar 27, 2025

KennethEnevoldsen commented Mar 27, 2025

ayush1298 commented Mar 27, 2025 • edited Loading

KennethEnevoldsen commented Mar 27, 2025

KennethEnevoldsen commented Feb 26, 2025 •

edited

Loading

Samoed commented Mar 16, 2025 •

edited

Loading

ayush1298 commented Mar 17, 2025 •

edited

Loading

ayush1298 commented Mar 22, 2025 •

edited

Loading

ayush1298 commented Mar 22, 2025 •

edited

Loading

Samoed commented Mar 24, 2025 •

edited

Loading

ayush1298 commented Mar 24, 2025 •

edited by Samoed

Loading

Samoed commented Mar 24, 2025 •

edited

Loading

Samoed commented Mar 24, 2025 •

edited

Loading

ayush1298 commented Mar 24, 2025 •

edited

Loading

ayush1298 commented Mar 27, 2025 •

edited

Loading