Skip to content

Clustering crashes: ValueError("Columns must be same length as key") - too little input text maybe? #362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
simoncelinder opened this issue Jul 4, 2024 · 14 comments

Comments

@simoncelinder
Copy link

simoncelinder commented Jul 4, 2024

Hi!

I was able to reproduce the example at: https://microsoft.github.io/graphrag/posts/get_started/

However when I switch to use the exact same method but with some shorter fictional stories, it crashes during the clustering part.

Text input:
The input text is that I paste this into a txt: https://gist.github.com/simoncelinder/0fbb9aaebed1e21801ab6c6e11a0dda5

Error:
When then running the python -m graphrag.index --root ./ragtest I get (where my added printouts suggest that the cluster_graph function gets empty list input):
image

Maybe less relevant since downstream from this problem - inspecting the log files suggest shape mismatch:

21:21:04,371 graphrag.index.run ERROR error running workflow create_base_entity_graph
Traceback (most recent call last):
File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/graphrag/index/run.py", line 323, in run_pipeline
result = await workflow.run(context, callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/datashaper/workflow/workflow.py", line 369, in run
timing = await self._execute_verb(node, context, callbacks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/datashaper/workflow/workflow.py", line 410, in _execute_verb
result = node.verb.func(**verb_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/graphrag/index/verbs/graph/clustering/cluster_graph.py", line 102, in cluster_graph
output_df[[level_to, to]] = pd.DataFrame(
~~~~~~~~~^^^^^^^^^^^^^^^^
File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/pandas/core/frame.py", line 4299, in setitem
self._setitem_array(key, value)
File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/pandas/core/frame.py", line 4341, in _setitem_array
check_key_length(self.columns, key, value)
File "/Users/simon/git/quick-exp/.pyenv/lib/python3.11/site-packages/pandas/core/indexers/utils.py", line 390, in check_key_length
raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key

So to reproduce this problem one would just put the text example in the input file (input/book.txt) and execute exactly as in the guide. I tried tweaking some params in settings.yaml as I assumed the problem was with the shorter input text, like various lengths, chunk sizes, max num clusters etc but without any luck so far.

My versions:

  • Python: 3.11.9
  • Graphrag: 0.1.1
  • OS: MacOS Sonoma 14.5

Any ideas? :-)

Thanks in advance and this seems like a really nice tool!

@eyast
Copy link

eyast commented Jul 5, 2024

I was not able to reproduce the error you've faced. The process completed successfully on my end, and I can see the communities generated, with their summaries. In my setup, I use GPT4-o, otherwise it's based on the standard library.
If you explore the folder structure in outputs, you can find artefacts, as well as interesting logs outputs\{timestamp}\reports\ .
For example, make sure that you are not hitting some rate limits that prevent you from proceeding further down in the pipeline process.
You can also find the artifacts of each step generated in a parquet file in the artifacts folder. I use tad to explore the contents of the files.
PS: Writing part of the story in 1st person is ingenious if you ask me - I wonder if you need to modify your prompt or entity configuration to make sure the LLM retrieves the narrator as an entity.

@simoncelinder
Copy link
Author

Ok will try again with GPT4-o!

(The main idea is to test the capability to combine together stories told from different perspectives also “about” the central person, hence not always first person perspective, thanks for the input about checking the prompt though 👍🏻.)

@simoncelinder
Copy link
Author

simoncelinder commented Jul 5, 2024

Seems to work now, maybe it was just my project that was in some weird state or the env variables having comments, other variables in .env or not names exactly right. Works with all the defaults incl default LLM. Thanks for the help! 💪🏻

@greenpillboi
Copy link

So the short input was not a problem?

Can you please list the changes you made from the non working config to produce the working config?

You mentioned using gpt4o - was this a change you made? I'm getting this error trying to run with a locally hosted llama3 model.

@cd80
Copy link

cd80 commented Jul 12, 2024

I had same issue and I was using vLLM like #357
The problem got solved after pip install git+https://github.com/microsoft/graphrag

@SiNeiP
Copy link

SiNeiP commented Jul 23, 2024

environment:
python:3.,11
system: ubuntu
model: llama3:8b&nomic
配置 : 3090&24G显存
learn url : https://blog.stoeng.site/20240707.html
运行下面的命令创建pipe报错
python -m graphrag.index --root ./ragpdf
error:
{"type": "error", "data": "Community Report Extraction Error", "stack": "Traceback (most recent call last):\n File "/home/fox/ai/graphrag/graphrag/index/graph/extractors/community_reports/community_reports_extractor.py", line 58, in call\n await self._llm(\n File "/home/fox/ai/graphrag/graphrag/llm/openai/json_parsing_llm.py", line 34, in call\n result = await self._delegate(input, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/ai/graphrag/graphrag/llm/openai/openai_token_replacing_llm.py", line 37, in call\n return await self._delegate(input, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/ai/graphrag/graphrag/llm/openai/openai_history_tracking_llm.py", line 33, in call\n output = await self._delegate(input, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/ai/graphrag/graphrag/llm/base/caching_llm.py", line 104, in call\n result = await self._delegate(input, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/ai/graphrag/graphrag/llm/base/rate_limiting_llm.py", line 177, in call\n result, start = await execute_with_retry()\n ^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/ai/graphrag/graphrag/llm/base/rate_limiting_llm.py", line 159, in execute_with_retry\n async for attempt in retryer:\n File "/home/fox/rag2_env/lib/python3.11/site-packages/tenacity/asyncio/init.py", line 166, in anext\n do = await self.iter(retry_state=self._retry_state)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/rag2_env/lib/python3.11/site-packages/tenacity/asyncio/init.py", line 153, in iter\n result = await action(retry_state)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/rag2_env/lib/python3.11/site-packages/tenacity/_utils.py", line 99, in inner\n return call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "/home/fox/rag2_env/lib/python3.11/site-package

在win上和mac上均创建管道失败,help me! please!

@simoncelinder
Copy link
Author

Hi!
Sorry for late reply!

  • I did not change LLM, only using default
  • I made a new .env file from scratch that ONLY had the credential for graph_rag nothing else
  • Kept the short text in my example
  • Did not change any prompts
  • I made sure to nuke the entire folder every time trying something new, for this process I made a bash script to help (can provide if someone wants it)

@simoncelinder
Copy link
Author

simoncelinder commented Jul 24, 2024

Here is the bash script I use now if of use to someone

echo deleting ragtest folder
rm -rf ./ragtest

echo creating ragtest folder
mkdir -p ./ragtest/input

echo copying input text file
cp book.txt ./ragtest/input/

echo initializing index / creating project
python -m graphrag.index --init --root ./ragtest

echo copying .env file with only GRAPHRAG_API_KEY=
cp .env ./ragtest/

echo running pipeline
python -m graphrag.index --root ./ragtest

@BaronHsu
Copy link

Hello, I also encounter same error below

  • Descriptions about error:
39  aa8d2310a206001404282ddb3fd645aa  .C. The Project Gutenberg Literary Archive Fou...  ...       1200
40  0ddc17ea5e566006c000b4013f2181a5   charge a reasonable fee for copies of or prov...  ...       1200
41  cd4234ed6caba8f15d09a2e3ee604b2a  . The invalidity or\nunenforceability of any p...  ...       1055

[42 rows x 5 columns]
🚀 create_base_extracted_entities
                                        entity_graph
0  <graphml xmlns="http://graphml.graphdrawing.or...
🚀 create_summarized_entities
                                        entity_graph
0  <graphml xmlns="http://graphml.graphdrawing.or...
❌ create_base_entity_graph
None
⠇ GraphRAG Indexer 
├── Loading Input (text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
└── create_base_entity_graph
❌ Errors occurred during the pipeline run, see logs for more details.

Parameters:
LLM=Ollama (gemma2)
Embedding LLM = vllm (e5-mistral-7b-instruct)

Example 1:
Entity_types: [person, technology, mission, organization, location]
Text:
while Alex clenched his jaw, the buzz of frustration dull against the backdrop of ...
################
Output:
("entity"{tuple_delimiter}"<name>"{tuple_delimiter}"<Entity_types>"{tuple_delimiter}"Alex is a character who experiences frustration and is observant of the dynamics among other characters."){record_delimiter}

Then file at <workfolder>/cache/entity_extration/chat-xxxx
expect format: "entity"{tuple_delimiter}"<name>"{tuple_delimiter}"<Entity_types>"{tuple_delimiter}"
I got:
error_entity_text

  • Reason and Solution:
    LLM seems doesn't understand what prompt says.It may be various reasons such like LLM's max context window, or just services is not working as expect.
    So I try lower chunk size from 1200 to 300, and it works sucessfully.
    Here is my new chat-xxxx
    correct_entity_text
    Hope it will help you!

@kakalong136
Copy link

Seems to work now, maybe it was just my project that was in some weird state or the env variables having comments, other variables in .env or not names exactly right. Works with all the defaults incl default LLM. Thanks for the help! 💪🏻

How did you solve this problem?

@kakalong136
Copy link

Sorry for late reply!

  • I did not change LLM, only using default
  • I made a new .env file from scratch that ONLY had the credential for graph_rag nothing else
  • Kept the short text in my example
  • Did not change any prompts
  • I made sure to nuke the entire folder every time trying something new, for this process I made a bash script to help (can provide if someone wants it)

is that method?

@simoncelinder
Copy link
Author

Sorry for late reply!

  • I did not change LLM, only using default
  • I made a new .env file from scratch that ONLY had the credential for graph_rag nothing else
  • Kept the short text in my example
  • Did not change any prompts
  • I made sure to nuke the entire folder every time trying something new, for this process I made a bash script to help (can provide if someone wants it)

is that method?

Yes!

@shellchange
Copy link

Hello, I also encounter same error below

  • Descriptions about error:
39  aa8d2310a206001404282ddb3fd645aa  .C. The Project Gutenberg Literary Archive Fou...  ...       1200
40  0ddc17ea5e566006c000b4013f2181a5   charge a reasonable fee for copies of or prov...  ...       1200
41  cd4234ed6caba8f15d09a2e3ee604b2a  . The invalidity or\nunenforceability of any p...  ...       1055

[42 rows x 5 columns]
🚀 create_base_extracted_entities
                                        entity_graph
0  <graphml xmlns="http://graphml.graphdrawing.or...
🚀 create_summarized_entities
                                        entity_graph
0  <graphml xmlns="http://graphml.graphdrawing.or...
❌ create_base_entity_graph
None
⠇ GraphRAG Indexer 
├── Loading Input (text) - 1 files loaded (0 filtered) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 0:00:00
├── create_base_text_units
├── create_base_extracted_entities
├── create_summarized_entities
└── create_base_entity_graph
❌ Errors occurred during the pipeline run, see logs for more details.

Parameters: LLM=Ollama (gemma2) Embedding LLM = vllm (e5-mistral-7b-instruct)

Example 1:
Entity_types: [person, technology, mission, organization, location]
Text:
while Alex clenched his jaw, the buzz of frustration dull against the backdrop of ...
################
Output:
("entity"{tuple_delimiter}"<name>"{tuple_delimiter}"<Entity_types>"{tuple_delimiter}"Alex is a character who experiences frustration and is observant of the dynamics among other characters."){record_delimiter}

Then file at <workfolder>/cache/entity_extration/chat-xxxx expect format: "entity"{tuple_delimiter}"<name>"{tuple_delimiter}"<Entity_types>"{tuple_delimiter}" I got: error_entity_text

  • Reason and Solution:
    LLM seems doesn't understand what prompt says.It may be various reasons such like LLM's max context window, or just services is not working as expect.
    So I try lower chunk size from 1200 to 300, and it works sucessfully.
    Here is my new chat-xxxx
    correct_entity_text
    Hope it will help you!

try lower chunk size from 1200 to 300, and it works sucessfully.

@RajSharma1902
Copy link

did anyone use llama3.2 and faced similar issue
pls tell how you solved this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants