Generating Fine Details of Entity Interactions

Abstract

Images not only depict objects but also encapsulate rich interactions between them. However, generating faithful and high-fidelity images involving multiple entities interacting with each other, is a long-standing challenge. While pre-trained text-to-image models are trained on large-scale datasets to follow diverse text instructions, they struggle to generate accurate interactions, likely due to the scarcity of training data for uncommon object interactions.

This paper introduces InterActing, an interaction-focused dataset with 1000 fine-grained prompts covering three key scenarios: (1) functional and action-based interactions, (2) compositional spatial relationships, and (3) multi-subject interactions.

To address interaction generation challenges, we propose a decomposition-augmented refinement procedure. Our approach, DetailScribe, built on Stable Diffusion 3.5, leverages LLMs to decompose interactions into finer-grained concepts, uses a VLM to critique generated images, and applies targeted interventions within the diffusion process in refinement. Automatic and human evaluations show significantly improved image quality, demonstrating the potential of enhanced inference strategies.

InterActing

1000 interaction-focused prompts of three scenarios.

Scenario	Subclass	Examples
Functional and Action-Based Interactions (600)	Tool Manipulation (227)	cutting, painting, sailing, stirring, taking a photo
Functional and Action-Based Interactions (600)	Physical Contact (373)	sculpting snow, stacking, holding
Compositional Spatial Relationships (200)	Abstract Layouts (183)	tic-tac-toe, table, atom, solar system, forest, tree, bookshelf
Compositional Spatial Relationships (200)	Geometric Patterns (17)	zig-zag pattern, circle, center
Multi-subject Interactions (200)	Interaction (200)	huddling, high-five, collaborating to lift, weaving leaves together, sharing food

Table 1: The InterActing dataset contains 1000 text-to-image prompts. We categorize them into subclass and count the occurrence.

Evaluation Metrics

Due to the challenges in assessing whether an image aligns with the prompt's description, we primarily rely on human evaluation, referred to as the human Likert scale. We further explored the use of VLMs and pre-trained metrics for automatic evaluation purposes (Automatic evaluation.).

We note that these evaluations are inherently more noisy, so we compared their agreement with human preferences on sampled image pairs generated by all models. Overall, VLM evaluator achieves the highest agreement at 90.4%, compared to the other metrics: ImageReward (73.6%), CLIPScore (70.4%), and BLIP-VQA (67.6%).

DetailScribe

A refinement-augmented image generation framework utilizing the reasoning ability of LLMs.

DetailScribe operates in three stages:

1) given an input natural language prompt, a large language model hierarchically decomposes it into detailed sub-concepts;

2) an initial image is generated from the prompt using a text-to-image model, followed by a vision-language model critique conditioned on both the decomposed sub-concepts and the generated image;

3) based on the critique, the prompt is refined and a re-denoising process corrects errors, yielding a more faithful and realistic generated image.

DetailScribe-generated image gallery — **Figure 2:** The overall pipeline of DetailScribe.

Results

Evaluation

We evaluate the models on three scenarios from the InterActing dataset and report the results separately. Due to the scalability of high-quality human evaluation, we sampled 50 prompts from InterActing for both human evaluation and automatic evaluation, and compared the agreement in between. Table 2 shows the average human/VLM likert scale (1 - 5) and pre-trained metrics on three scenarios of sampled InterActing dataset.

We report the human Likert scale (Human Evaluation), VLM evaluation score (GPT-4o), as well as ImageReward (ImReward), CLIPScore (CLIPS.) and BLIP-VQA (B-VQA) score.

	Functional Relation					Compositional Relation					Multi-subject Interaction
	Human	GPT-4o	ImReward	CLIPS.	B-VQA	Human	GPT-4o	ImReward	CLIPS.	B-VQA	Human	GPT-4o	ImReward	CLIPS.	B-VQA
SD	3.360	4.680	1.657	0.971	0.366	3.583	4.533	1.415	0.898	0.379	3.225	4.000	1.171	0.894	0.306
+ GPT Rewrite	3.770	4.680	1.546	0.921	0.245	3.167	4.533	1.283	0.867	0.325	3.275	4.200	1.101	0.890	0.292
+ GPT Refine	3.450	4.200	1.524	0.951	0.290	3.667	4.867	1.390	0.889	0.389	3.175	3.700	1.022	0.858	0.292
+ Multi-seed	3.270	4.560	1.718	0.985	0.434	3.650	4.733	1.538	0.912	0.423	3.375	4.000	1.149	0.903	0.302
DALL·E 3	3.940	4.720	1.535	0.880	0.226	3.433	4.867	1.382	0.838	0.367	3.775	4.600	1.111	0.813	0.286
DetailScribe	4.280	4.960	1.761	0.998	0.449	4.283	5.000	1.545	0.923	0.485	3.800	4.400	1.326	0.907	0.343

Table 2: DetailScribe receives the highest scores according to human preference in all scenarios.

According to the above evaluation results, VLM evaluator achieves the highest agreement to human evaluation at 90.4%, compared to the other metrics: ImageReward (73.6%), CLIPScore (70.4%), and BLIP-VQA (67.6%).

We further presented the automatic evaluation on the entire InterActing in Figure 4.

Automatic metrics — **Figure 4:** **DetailScribe** outperforms all baselines on all pre-trained metrics.

Qualitative Comparison

BibTeX

@misc{gu2025generatingfinedetailsentity,
      title={Generating Fine Details of Entity Interactions},
      author={Xinyi Gu and Jiayuan Mao},
      year={2025},
      eprint={2504.08714},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.08714}
      }

Generating Fine Details of Entity Interactions

Abstract

InterActing

1000 interaction-focused prompts of three scenarios.

Evaluation Metrics

DetailScribe

A refinement-augmented image generation framework utilizing the reasoning ability of LLMs.

Results

Evaluation

Qualitative Comparison

"An anime of a hedgehog in a tiny apron, rolling dough with a miniature rolling pin, preparing a berry pie with a cheerful expression."

“A cat sails across the sea in a large seashell, holding a mast.”

“A rabbit carefully sculpts a tiny snow bunny with its paws, adding details like ears and whiskers to the figure.”

“An anime of a mouse constructing a tiny castle with blocks, carefully stacking each piece, with tools like a mini hammer and ruler scattered around.”

“A zigzag pattern made of scattered autumn leaves, creating a path that alternates left and right as it moves forward.”

“A circle of sunflowers with a single, vibrant red rose in the very center, surrounded by the larger yellow blooms."

“Two birds bring twigs and leaves to a tree branch, weaving them together to create a shared nest.”

“Two ants collaborate to lift a large crumb, balancing it carefully as they carry it back to their colony.”

“A table formed by pretzels stacked together.”

“A tree made of a pretzel stick as the trunk, with green gummy leaves and woolen yarn roots branching out.”

“A cat reading books in the cozy library.”

“A cat painted a portrait using vibrant watercolors.”

“A crow wearing glasses mixing colorful, glowing potions in tiny vials using its beak in a spooky forest.”

BibTeX