Clustering crashes: ValueError("Columns must be same length as key") - too little input text maybe? #362

simoncelinder · 2024-07-04T15:12:16Z

Hi!

I was able to reproduce the example at: https://microsoft.github.io/graphrag/posts/get_started/

However when I switch to use the exact same method but with some shorter fictional stories, it crashes during the clustering part.

Text input:
The input text is that I paste this into a txt: https://gist.github.com/simoncelinder/0fbb9aaebed1e21801ab6c6e11a0dda5

Error:
When then running the python -m graphrag.index --root ./ragtest I get (where my added printouts suggest that the cluster_graph function gets empty list input):

Maybe less relevant since downstream from this problem - inspecting the log files suggest shape mismatch:

21:21:04,371 graphrag.index.run ERROR error running workflow create_base_entity_graph
Traceback (most recent call last):
File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/graphrag/index/run.py", line 323, in run_pipeline
result = await workflow.run(context, callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/datashaper/workflow/workflow.py", line 369, in run
timing = await self._execute_verb(node, context, callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb
result = node.verb.func(**verb_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 102, in cluster_graph
output_df[[level_to, to]] = pd.DataFrame(
~~~~~~~~~^^^^^^^^^^^^^^^^
File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/pandas/core/frame.py", line 4299, in setitem
self._setitem_array(key, value)
File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/pandas/core/frame.py", line 4341, in _setitem_array
check_key_length(self.columns, key, value)
File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length
raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

So to reproduce this problem one would just put the text example in the input file (input/book.txt) and execute exactly as in the guide. I tried tweaking some params in settings.yaml as I assumed the problem was with the shorter input text, like various lengths, chunk sizes, max num clusters etc but without any luck so far.

My versions:

Python: 3.11.9
Graphrag: 0.1.1
OS: MacOS Sonoma 14.5

Any ideas? :-)

Thanks in advance and this seems like a really nice tool!

eyast · 2024-07-05T04:31:15Z

I was not able to reproduce the error you've faced. The process completed successfully on my end, and I can see the communities generated, with their summaries. In my setup, I use GPT4-o, otherwise it's based on the standard library.
If you explore the folder structure in outputs, you can find artefacts, as well as interesting logs outputs\{timestamp}\reports\ .
For example, make sure that you are not hitting some rate limits that prevent you from proceeding further down in the pipeline process.
You can also find the artifacts of each step generated in a parquet file in the artifacts folder. I use tad to explore the contents of the files.
PS: Writing part of the story in 1st person is ingenious if you ask me - I wonder if you need to modify your prompt or entity configuration to make sure the LLM retrieves the narrator as an entity.

simoncelinder · 2024-07-05T07:33:58Z

Ok will try again with GPT4-o!

(The main idea is to test the capability to combine together stories told from different perspectives also “about” the central person, hence not always first person perspective, thanks for the input about checking the prompt though 👍🏻.)

simoncelinder · 2024-07-05T09:54:46Z

Seems to work now, maybe it was just my project that was in some weird state or the env variables having comments, other variables in .env or not names exactly right. Works with all the defaults incl default LLM. Thanks for the help! 💪🏻

greenpillboi · 2024-07-11T13:47:52Z

So the short input was not a problem?

Can you please list the changes you made from the non working config to produce the working config?

You mentioned using gpt4o - was this a change you made? I'm getting this error trying to run with a locally hosted llama3 model.

cd80 · 2024-07-12T04:32:18Z

I had same issue and I was using vLLM like #357
The problem got solved after pip install git+https://github.com/microsoft/graphrag

SiNeiP · 2024-07-23T10:19:08Z

environment:
python:3.,11
system: ubuntu
model: llama3：8b&nomic
配置： 3090&24G显存
learn url : https://blog.stoeng.site/20240707.html
运行下面的命令创建pipe报错
python -m graphrag.index --root ./ragpdf
error:
{"type": "error", "data": "Community Report Extraction Error", "stack": "Traceback (most recent call last):\n File "/home/fox/ai/graphrag/graphrag/index/graph/extractors/community_reports/community_reports_extractor.py", line 58, in call\n await self._llm(\n File "/home/fox/ai/graphrag/graphrag/llm/openai/json_parsing_llm.py", line 34, in call\n result = await self._delegate(input, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/ai/graphrag/graphrag/llm/openai/openai_token_replacing_llm.py", line 37, in call\n return await self._delegate(input, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/ai/graphrag/graphrag/llm/openai/openai_history_tracking_llm.py", line 33, in call\n output = await self._delegate(input, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/ai/graphrag/graphrag/llm/base/caching_llm.py", line 104, in call\n result = await self._delegate(input, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/ai/graphrag/graphrag/llm/base/rate_limiting_llm.py", line 177, in call\n result, start = await execute_with_retry()\n ^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/ai/graphrag/graphrag/llm/base/rate_limiting_llm.py", line 159, in execute_with_retry\n async for attempt in retryer:\n File "/home/fox/rag2_env/lib/python3.11/site-packages/tenacity/asyncio/init.py", line 166, in anext\n do = await self.iter(retry_state=self._retry_state)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/rag2_env/lib/python3.11/site-packages/tenacity/asyncio/init.py", line 153, in iter\n result = await action(retry_state)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/rag2_env/lib/python3.11/site-packages/tenacity/_utils.py", line 99, in inner\n return call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/rag2_env/lib/python3.11/site-package

在win上和mac上均创建管道失败，help me! please!

simoncelinder · 2024-07-23T13:12:05Z

Hi!
Sorry for late reply!

I did not change LLM, only using default
I made a new .env file from scratch that ONLY had the credential for graph_rag nothing else
Kept the short text in my example
Did not change any prompts
I made sure to nuke the entire folder every time trying something new, for this process I made a bash script to help (can provide if someone wants it)

simoncelinder · 2024-07-24T07:23:21Z

Here is the bash script I use now if of use to someone

echo deleting ragtest folder
rm -rf ./ragtest

echo creating ragtest folder
mkdir -p ./ragtest/input

echo copying input text file
cp book.txt ./ragtest/input/

echo initializing index / creating project
python -m graphrag.index --init --root ./ragtest

echo copying .env file with only GRAPHRAG_API_KEY=
cp .env ./ragtest/

echo running pipeline
python -m graphrag.index --root ./ragtest

BaronHsu · 2024-07-25T08:46:20Z

Hello, I also encounter same error below

Descriptions about error:

39  aa8d2310a206001404282ddb3fd645aa  .C. The Project Gutenberg Literary Archive Fou...  ...       1200
40  0ddc17ea5e566006c000b4013f2181a5   charge a reasonable fee for copies of or prov...  ...       1200
41  cd4234ed6caba8f15d09a2e3ee604b2a  . The invalidity or\nunenforceability of any p...  ...       1055

[42 rows x 5 columns]
🚀 create_base_extracted_entities
                                        entity_graph
0  <graphml xmlns="http://graphml.graphdrawing.or...
🚀 create_summarized_entities
                                        entity_graph
0  <graphml xmlns="http://graphml.graphdrawing.or...
❌ create_base_entity_graph
None
⠇ GraphRAG Indexer 
├── Loading Input (text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
└── create_base_entity_graph
❌ Errors occurred during the pipeline run, see logs for more details.

Parameters:
LLM=Ollama (gemma2)
Embedding LLM = vllm (e5-mistral-7b-instruct)

Observation:
** note: relative discussion at [Bug] "ValueError: Columns must be same length as key" - Entity extraction fails due to invalid format returned by API #443 and Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key #414
take a look at file <workfolder>/prompts/entity_extraction.txt
prompts tell LLM to output as format like:

Example 1:
Entity_types: [person, technology, mission, organization, location]
Text:
while Alex clenched his jaw, the buzz of frustration dull against the backdrop of ...
################
Output:
("entity"{tuple_delimiter}"<name>"{tuple_delimiter}"<Entity_types>"{tuple_delimiter}"Alex is a character who experiences frustration and is observant of the dynamics among other characters."){record_delimiter}

Then file at <workfolder>/cache/entity_extration/chat-xxxx
expect format: "entity"{tuple_delimiter}"<name>"{tuple_delimiter}"<Entity_types>"{tuple_delimiter}"
I got:

Reason and Solution:
LLM seems doesn't understand what prompt says.It may be various reasons such like LLM's max context window, or just services is not working as expect.
So I try lower chunk size from 1200 to 300, and it works sucessfully.
Here is my new chat-xxxx

Hope it will help you!

kakalong136 · 2024-07-27T09:52:18Z

Seems to work now, maybe it was just my project that was in some weird state or the env variables having comments, other variables in .env or not names exactly right. Works with all the defaults incl default LLM. Thanks for the help! 💪🏻

How did you solve this problem?

kakalong136 · 2024-07-27T09:55:28Z

Sorry for late reply!

I did not change LLM, only using default

I made a new .env file from scratch that ONLY had the credential for graph_rag nothing else

Kept the short text in my example

Did not change any prompts

I made sure to nuke the entire folder every time trying something new, for this process I made a bash script to help (can provide if someone wants it)

is that method？

simoncelinder · 2024-07-28T12:08:11Z

Sorry for late reply!

I did not change LLM, only using default

I made a new .env file from scratch that ONLY had the credential for graph_rag nothing else

Kept the short text in my example

Did not change any prompts

I made sure to nuke the entire folder every time trying something new, for this process I made a bash script to help (can provide if someone wants it)

is that method？

Yes!

shellchange · 2024-08-10T06:51:31Z

Hello, I also encounter same error below

Descriptions about error:
39  aa8d2310a206001404282ddb3fd645aa  .C. The Project Gutenberg Literary Archive Fou...  ...       1200
40  0ddc17ea5e566006c000b4013f2181a5   charge a reasonable fee for copies of or prov...  ...       1200
41  cd4234ed6caba8f15d09a2e3ee604b2a  . The invalidity or\nunenforceability of any p...  ...       1055

[42 rows x 5 columns]
🚀 create_base_extracted_entities
                                        entity_graph
0  <graphml xmlns="http://graphml.graphdrawing.or...
🚀 create_summarized_entities
                                        entity_graph
0  <graphml xmlns="http://graphml.graphdrawing.or...
❌ create_base_entity_graph
None
⠇ GraphRAG Indexer 
├── Loading Input (text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
└── create_base_entity_graph
❌ Errors occurred during the pipeline run, see logs for more details.
Parameters: LLM=Ollama (gemma2) Embedding LLM = vllm (e5-mistral-7b-instruct)

Observation:
** note: relative discussion at [Bug] "ValueError: Columns must be same length as key" - Entity extraction fails due to invalid format returned by API #443 and Error executing verb "cluster_graph" in create_base_entity_graph: Columns must be same length as key #414
take a look at file <workfolder>/prompts/entity_extraction.txt
prompts tell LLM to output as format like:
Example 1:
Entity_types: [person, technology, mission, organization, location]
Text:
while Alex clenched his jaw, the buzz of frustration dull against the backdrop of ...
################
Output:
("entity"{tuple_delimiter}"<name>"{tuple_delimiter}"<Entity_types>"{tuple_delimiter}"Alex is a character who experiences frustration and is observant of the dynamics among other characters."){record_delimiter}
Then file at <workfolder>/cache/entity_extration/chat-xxxx expect format: "entity"{tuple_delimiter}"<name>"{tuple_delimiter}"<Entity_types>"{tuple_delimiter}" I got:

Reason and Solution:
LLM seems doesn't understand what prompt says.It may be various reasons such like LLM's max context window, or just services is not working as expect.
So I try lower chunk size from 1200 to 300, and it works sucessfully.
Here is my new chat-xxxx

Hope it will help you!

try lower chunk size from 1200 to 300, and it works sucessfully.

RajSharma1902 · 2024-10-18T19:13:37Z

did anyone use llama3.2 and faced similar issue
pls tell how you solved this

simoncelinder closed this as completed Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering crashes: ValueError("Columns must be same length as key") - too little input text maybe? #362

Clustering crashes: ValueError("Columns must be same length as key") - too little input text maybe? #362

simoncelinder commented Jul 4, 2024 •

edited

Loading

eyast commented Jul 5, 2024

simoncelinder commented Jul 5, 2024

simoncelinder commented Jul 5, 2024 •

edited

Loading

greenpillboi commented Jul 11, 2024

cd80 commented Jul 12, 2024

SiNeiP commented Jul 23, 2024

simoncelinder commented Jul 23, 2024

simoncelinder commented Jul 24, 2024 •

edited

Loading

BaronHsu commented Jul 25, 2024

kakalong136 commented Jul 27, 2024

kakalong136 commented Jul 27, 2024

simoncelinder commented Jul 28, 2024

shellchange commented Aug 10, 2024

RajSharma1902 commented Oct 18, 2024

Clustering crashes: ValueError("Columns must be same length as key") - too little input text maybe? #362

Clustering crashes: ValueError("Columns must be same length as key") - too little input text maybe? #362

Comments

simoncelinder commented Jul 4, 2024 • edited Loading

eyast commented Jul 5, 2024

simoncelinder commented Jul 5, 2024

simoncelinder commented Jul 5, 2024 • edited Loading

greenpillboi commented Jul 11, 2024

cd80 commented Jul 12, 2024

SiNeiP commented Jul 23, 2024

simoncelinder commented Jul 23, 2024

simoncelinder commented Jul 24, 2024 • edited Loading

BaronHsu commented Jul 25, 2024

kakalong136 commented Jul 27, 2024

kakalong136 commented Jul 27, 2024

simoncelinder commented Jul 28, 2024

shellchange commented Aug 10, 2024

RajSharma1902 commented Oct 18, 2024

simoncelinder commented Jul 4, 2024 •

edited

Loading

simoncelinder commented Jul 5, 2024 •

edited

Loading

simoncelinder commented Jul 24, 2024 •

edited

Loading