Skip to content

add ernie image support#1427

Merged
leejet merged 3 commits intomasterfrom
ernie
Apr 16, 2026
Merged

add ernie image support#1427
leejet merged 3 commits intomasterfrom
ernie

Conversation

@leejet
Copy link
Copy Markdown
Owner

@leejet leejet commented Apr 15, 2026

.\bin\Release\sd-cli.exe --diffusion-model  ..\..\ComfyUI\models\diffusion_models\ernie-image-turbo.safetensors --vae ..\..\ComfyUI\models\vae\flux2_ae.safetensors  --llm ..\..\ComfyUI\models\text_encoders\ministral-3-3b.safetensors -p "a lovely cat holding a sign says 'ernie.cpp'" --cfg-scale 1.0 --steps 8 -v --offload-to-cpu --diffusion-fa
output

@leejet leejet mentioned this pull request Apr 15, 2026
@candrews
Copy link
Copy Markdown

candrews commented Apr 16, 2026

Can an ernie.md file please be added under docs that includes how to use ernie, ernie-turbo, and the prompt enhancer?

@Green-Sky
Copy link
Copy Markdown
Contributor

Tried some quants for turbo with flux2 vae smaller decoder #1402:

q6_K output
q5_K output
q5_0 output
q4_K output
q4_0 output
q3_K output

Quants work really well with this model. Must be the arch.

@Green-Sky
Copy link
Copy Markdown
Contributor

1280x1280 q4_k turbo small-vae

tomato_Image_2026-04-16_17-27-40 770
[INFO ] ggml_extend.hpp:1957 - ernie_image offload params (4357.36 MB, 409 tensors) to runtime backend (CUDA0), taking 0.38s
[DEBUG] ggml_extend.hpp:1859 - ernie_image compute buffer size: 1011.37 MB(VRAM)
  |==================================================| 8/8 - 8.23s/it
[INFO ] stable-diffusion.cpp:3195 - sampling completed, taking 66.26s

Looks really good.

@kuhnchris
Copy link
Copy Markdown

kuhnchris commented Apr 16, 2026

Interesting, using the text_encoder (the .safetensors) seem to fail to load, as they seem to hae all the data in a sub-node "language_model". Using a different Ministral-3B does work tho.

Failing:
VAE: https://huggingface.co/baidu/ERNIE-Image-Turbo/blob/main/vae/diffusion_pytorch_model.safetensors
LLM: https://huggingface.co/baidu/ERNIE-Image-Turbo/tree/main/text_encoder
Model: https://huggingface.co/unsloth/ERNIE-Image-Turbo-GGUF/blob/main/ernie-image-turbo-Q8_0.gguf

~/sd-cli-ernie --diffusion-model ernie-image-turbo-Q8_0.gguf --vae ERNIE-vae.safetensors --llm ERNIE-llm.safetensors -H 1024 -W 1024 --diffusion-fa --flow-shift 3 -p 'An playing card rave, cartoon/anime style, flashing disco lights' -o test.png
...
[INFO ] model.cpp:1617 - unknown tensor 'text_encoders.llm.language_model.model.embed_tokens.weight | bf16 | 2 [3072, 131072, 1, 1, 1]' in model file
[INFO ] model.cpp:1617 - unknown tensor 'text_encoders.llm.language_model.model.layers.0.input_layernorm.weight | bf16 | 1 [3072, 1, 1, 1, 1]' in model file
[INFO ] model.cpp:1617 - unknown tensor 'text_encoders.llm.language_model.model.layers.0.mlp.down_proj.weight | bf16 | 2 [9216, 3072, 1, 1, 1]' in model file
[INFO ] model.cpp:1617 - unknown tensor 'text_encoders.llm.language_model.model.layers.0.mlp.gate_proj.weight | bf16 | 2 [3072, 9216, 1, 1, 1]' in model file
...
[ERROR] model.cpp:1658 - tensor 'text_encoders.llm.model.embed_tokens.weight' not in model file
[ERROR] model.cpp:1658 - tensor 'text_encoders.llm.model.layers.0.input_layernorm.weight' not in model file
[ERROR] model.cpp:1658 - tensor 'text_encoders.llm.model.layers.0.mlp.down_proj.weight' not in model file
[ERROR] model.cpp:1658 - tensor 'text_encoders.llm.model.layers.0.mlp.gate_proj.weight' not in model file

Working:
VAE: https://huggingface.co/baidu/ERNIE-Image-Turbo/blob/main/vae/diffusion_pytorch_model.safetensors
LLM: https://huggingface.co/unsloth/Ministral-3-3B-Instruct-2512-GGUF?show_file_info=Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf
Model: https://huggingface.co/unsloth/ERNIE-Image-Turbo-GGUF/blob/main/ernie-image-turbo-Q8_0.gguf

~/sd-cli-ernie --diffusion-model ernie-image-turbo-Q8_0.gguf --vae ERNIE-vae.safetensors --llm Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf --diffusion-fa -p 'An playing card rave, cartoon/anime style' -o test.png --cfg-scale 1.0 --steps 8 -v
--
[INFO ] auto_encoder_kl.hpp:517  - vae decoder: ch = 128
[DEBUG] ggml_extend.hpp:2046 - vae params backend buffer size =  94.72 MB(VRAM) (140 tensors)
[INFO ] stable-diffusion.cpp:774  - Using flash attention in the diffusion model
[DEBUG] stable-diffusion.cpp:803  - loading weights
[DEBUG] model.cpp:1333 - using 48 threads for model loading
[DEBUG] model.cpp:1355 - loading tensors from ernie-image-turbo-Q8_0.gguf
  |======================>                           | 409/893 - 4.88GB/s
[DEBUG] model.cpp:1355 - loading tensors from Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf
  |====================================>             | 645/893 - 3.44GB/s
[DEBUG] model.cpp:1355 - loading tensors from ERNIE-vae.safetensors
  |==================================================| 893/893 - 3.43GB/s
[INFO ] model.cpp:1584 - loading tensors completed, taking 2.99s (process: 0.00s, read: 0.07s, memcpy: 0.00s, convert: 0.02s, copy_to_backend: 1.70s)
[DEBUG] stable-diffusion.cpp:843  - finished loaded file
[INFO ] stable-diffusion.cpp:895  - total params memory size = 11690.70MB (VRAM 11690.70MB, RAM 0.00MB): text_encoders 3303.90MB(VRAM), diffusion_model 8292.08MB(VRAM), vae 94.72MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:977  - running in FLOW mode
[INFO ] stable-diffusion.cpp:3130 - generate_image 512x512
[INFO ] denoiser.hpp:499  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:2706 - sampling using Euler method
[DEBUG] conditioner.hpp:1695 - parse 'An playing card rave, cartoon/anime style' to [['', 1], ['An playing card rave, cartoon/anime style', 1], ['', 1], ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "" to tokens []
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "An playing card rave, cartoon/anime style" to tokens ["An", "Ġplaying", "Ġcard", "Ġra", "ve", ",", "Ġcartoon", "/an", "ime", "Ġstyle", ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "" to tokens []
[DEBUG] ggml_extend.hpp:1859 - ministral3.3b compute buffer size: 1.20 MB(VRAM)
[DEBUG] conditioner.hpp:1949 - computing condition graph completed, taking 101 ms
[INFO ] stable-diffusion.cpp:3060 - get_learned_condition completed, taking 0.10s
[INFO ] stable-diffusion.cpp:3164 - generating image: 1/1 - seed 42
[DEBUG] ggml_extend.hpp:1859 - ernie_image compute buffer size: 163.19 MB(VRAM)
  |==================================================| 8/8 - 3.08it/s
[INFO ] stable-diffusion.cpp:3195 - sampling completed, taking 2.65s
[INFO ] stable-diffusion.cpp:3213 - generating 1 latent images completed, taking 2.65s
[INFO ] stable-diffusion.cpp:3084 - decoding 1 latents
[DEBUG] ggml_extend.hpp:1859 - vae compute buffer size: 1664.50 MB(VRAM)
[DEBUG] vae.hpp:206  - computing vae decode graph completed, taking 0.13s
[INFO ] stable-diffusion.cpp:3100 - latent 1 decoded, taking 0.13s
[INFO ] stable-diffusion.cpp:3104 - decode_first_stage completed, taking 0.13s
[INFO ] stable-diffusion.cpp:3225 - generate_image completed in 2.95s
[INFO ] main.cpp:438  - save result image 0 to 'test.png' (success)
[INFO ] main.cpp:487  - 1/1 images saved

While this works, as soon as I provide any of the officially supported width/height parameters (-W 1024 -H 1024) i only get a white output...

Resolution:
1024x1024
848x1264
1264x848
768x1376
896x1200
1376x768
1200x896
 ~/sd-cli-ernie --diffusion-model ernie-image-turbo-Q8_0.gguf --vae ERNIE-vae.safetensors --llm Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf --diffusion-fa -p 'An playing card rave, cartoon/anime style' -o test.png --cfg-scale 1.0 --steps 8 -v -W 1024 -H 1024 --seed 32

[INFO ] stable-diffusion.cpp:267  - loading diffusion model from 'ernie-image-turbo-Q8_0.gguf'
[INFO ] model.cpp:331  - load ernie-image-turbo-Q8_0.gguf using gguf format
[DEBUG] model.cpp:377  - init from 'ernie-image-turbo-Q8_0.gguf'
[INFO ] stable-diffusion.cpp:314  - loading llm from 'Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf'
[INFO ] model.cpp:331  - load Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf using gguf format
[DEBUG] model.cpp:377  - init from 'Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf'
[INFO ] stable-diffusion.cpp:328  - loading vae from 'ERNIE-vae.safetensors'
[INFO ] model.cpp:334  - load ERNIE-vae.safetensors using safetensors format
[DEBUG] model.cpp:468  - init from 'ERNIE-vae.safetensors', prefix = 'vae.'
[INFO ] stable-diffusion.cpp:353  - Version: Ernie Image
[INFO ] stable-diffusion.cpp:381  - Weight type stat:                      f32: 203  |    q8_0: 253  |    q4_K: 100  |    q5_K: 30   |    q6_K: 33   |  iq4_xs: 20   |    bf16: 254
[INFO ] stable-diffusion.cpp:382  - Conditioner weight type stat:          f32: 53   |    q4_K: 100  |    q5_K: 30   |    q6_K: 33   |  iq4_xs: 20
[INFO ] stable-diffusion.cpp:383  - Diffusion model weight type stat:      f32: 150  |    q8_0: 253  |    bf16: 6
[INFO ] stable-diffusion.cpp:384  - VAE weight type stat:                 bf16: 248
[DEBUG] stable-diffusion.cpp:386  - ggml tensor size = 400 bytes
[DEBUG] mistral_tokenizer.cpp:23   - vocab size: 131072
[DEBUG] mistral_tokenizer.cpp:31   - merges size 269443
[DEBUG] llm.hpp:693  - llm: num_layers = 26, vocab_size = 131072, hidden_size = 3072, intermediate_size = 9216
[INFO ] ernie_image.hpp:376  - ernie_image: layers = 36, hidden_size = 4096, heads = 32, ffn_hidden_size = 12288, in_channels = 128, out_channels = 128
[DEBUG] ggml_extend.hpp:2046 - ministral3.3b params backend buffer size =  3303.90 MB(VRAM) (236 tensors)
[DEBUG] ggml_extend.hpp:2046 - ernie_image params backend buffer size =  8292.08 MB(VRAM) (409 tensors)
[INFO ] stable-diffusion.cpp:679  - using VAE for encoding / decoding
[INFO ] auto_encoder_kl.hpp:517  - vae decoder: ch = 128
[DEBUG] ggml_extend.hpp:2046 - vae params backend buffer size =  94.72 MB(VRAM) (140 tensors)
[INFO ] stable-diffusion.cpp:774  - Using flash attention in the diffusion model
[DEBUG] stable-diffusion.cpp:803  - loading weights
[DEBUG] model.cpp:1333 - using 48 threads for model loading
[DEBUG] model.cpp:1355 - loading tensors from ernie-image-turbo-Q8_0.gguf
  |======================>                           | 409/893 - 4.77GB/s
[DEBUG] model.cpp:1355 - loading tensors from Ministral-3-3B-Instruct-2512-UD-Q4_K_XL.gguf
  |====================================>             | 645/893 - 3.41GB/s
[DEBUG] model.cpp:1355 - loading tensors from ERNIE-vae.safetensors
  |==================================================| 893/893 - 3.39GB/s
[INFO ] model.cpp:1584 - loading tensors completed, taking 3.02s (process: 0.00s, read: 0.07s, memcpy: 0.00s, convert: 0.02s, copy_to_backend: 1.74s)
[DEBUG] stable-diffusion.cpp:843  - finished loaded file
[INFO ] stable-diffusion.cpp:895  - total params memory size = 11690.70MB (VRAM 11690.70MB, RAM 0.00MB): text_encoders 3303.90MB(VRAM), diffusion_model 8292.08MB(VRAM), vae 94.72MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:977  - running in FLOW mode
[INFO ] stable-diffusion.cpp:3130 - generate_image 1024x1024
[INFO ] denoiser.hpp:499  - get_sigmas with discrete scheduler
[INFO ] stable-diffusion.cpp:2706 - sampling using Euler method
[DEBUG] conditioner.hpp:1695 - parse 'An playing card rave, cartoon/anime style' to [['', 1], ['An playing card rave, cartoon/anime style', 1], ['', 1], ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "" to tokens []
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "An playing card rave, cartoon/anime style" to tokens ["An", "Ġplaying", "Ġcard", "Ġra", "ve", ",", "Ġcartoon", "/an", "ime", "Ġstyle", ]
[DEBUG] bpe_tokenizer.cpp:183  - split prompt "" to tokens []
[DEBUG] ggml_extend.hpp:1859 - ministral3.3b compute buffer size: 1.20 MB(VRAM)
[DEBUG] conditioner.hpp:1949 - computing condition graph completed, taking 110 ms
[INFO ] stable-diffusion.cpp:3060 - get_learned_condition completed, taking 0.11s
[INFO ] stable-diffusion.cpp:3164 - generating image: 1/1 - seed 32
[DEBUG] ggml_extend.hpp:1859 - ernie_image compute buffer size: 647.69 MB(VRAM)
  |==================================================| 8/8 - 1.23s/it
[INFO ] stable-diffusion.cpp:3195 - sampling completed, taking 10.05s
[INFO ] stable-diffusion.cpp:3213 - generating 1 latent images completed, taking 10.06s
[INFO ] stable-diffusion.cpp:3084 - decoding 1 latents
[DEBUG] ggml_extend.hpp:1859 - vae compute buffer size: 6658.00 MB(VRAM)
[DEBUG] vae.hpp:206  - computing vae decode graph completed, taking 0.49s
[INFO ] stable-diffusion.cpp:3100 - latent 1 decoded, taking 0.51s
[INFO ] stable-diffusion.cpp:3104 - decode_first_stage completed, taking 0.51s
[INFO ] stable-diffusion.cpp:3225 - generate_image completed in 10.83s
[INFO ] main.cpp:438  - save result image 0 to 'test.png' (success)
[INFO ] main.cpp:487  - 1/1 images saved

However, removing the parameter "-W" and "-H" it all works again.
Not sure if this is connected to using the unsloth GGUFs tho.
Without -W and -H the generation is also far faster.

[DEBUG] ggml_extend.hpp:1859 - ernie_image compute buffer size: 163.19 MB(VRAM)
  |==================================================| 8/8 - 3.08it/s

(the default output seems to be 512x512, so that would check out)

@GreenShadows
Copy link
Copy Markdown

It would probably be a little faster once SD.cpp syncs with GGML and incorporates the latest optimizations.
ggml-org/llama.cpp#21713

@leejet
Copy link
Copy Markdown
Owner Author

leejet commented Apr 16, 2026

Can an ernie.md file please be added under docs that includes how to use ernie, ernie-turbo, and the prompt enhancer?

Done — ernie_image.md has been added under docs with usage for ernie image and ernie image turbo.

The prompt enhancer isn’t built into sd.cpp; it’s just standard LLM-based prompt expansion and can be done via tools like llama.cpp or ChatGPT / Gemini.

@leejet
Copy link
Copy Markdown
Owner Author

leejet commented Apr 16, 2026

Interesting, using the text_encoder (the .safetensors) seem to fail to load, as they seem to hae all the data in a sub-node "language_model". Using a different Ministral-3B does work tho.

@kuhnchris This is just a naming convention issue. You can download the compatible .safetensors files here: https://huggingface.co/Comfy-Org/ERNIE-Image/tree/main/text_encoders

@leejet leejet merged commit 5c243db into master Apr 16, 2026
10 of 15 checks passed
@candrews
Copy link
Copy Markdown

For the fun of it, instead of using a vae, I tried using the taef2 tae with --tae taef2.safetensors, and it does not work :)

[INFO ] stable-diffusion.cpp:694  - Using Conv2d direct in the vae model
[INFO ] stable-diffusion.cpp:776  - Using flash attention in the diffusion model
[766B blob data]
[444B blob data]
[WARN ] model.cpp:1433 - process tensor failed: 'tae.decoder.layers.0.weight'
[INFO ] model.cpp:1587 - loading tensors completed, taking 3.07s (process: 0.00s, read: 2.24s, memcpy: 0.00s, convert: 0.09s, copy_to_backend: 0.00s)
[ERROR] model.cpp:1645 - load tensors from file failed
[ERROR] stable-diffusion.cpp:840  - load tensors from model loader failed
[ERROR] main.cpp:90   - new_sd_ctx_t failed

@Green-Sky
Copy link
Copy Markdown
Contributor

Green-Sky commented Apr 17, 2026

I encountered some numerical issues on CUDA.

$ result/bin/sd-cli --diffusion-model /run/media/green/d20c801b-7aae-4042-85ec-bf2153257be8/green/workspace/stable-diffusion.cpp/models/ernie-image-Q4_K_M.gguf --llm models/Ministral-3-3B-Base-2512.Q8_0.gguf -t 10 --vae-tiling --vae models/flux2/full_encoder_small_decoder.safetensors --sampling-method euler --scheduler smoothstep --steps 30 --cfg-scale 3 -p "A sceneic view of the Alps. A tree partially visible in the foreground. Greenery mixed with harsh rocks." -n "artwork, blury" -v -W 1920 -H 1280 --offload-to-cpu --diffusion-fa --preview proj

Important here seems to be the size (full hd 😄 ) and steps. ernie-image-Q4_K_M.
At some late point the latents turn black.
28 is the first problematic step, projected preview:
preview
And it turns full black after that.


Tried 25 steps and it disintegrates at step 24.
preview


A smaller size with artifacts is 1600x1152.
Preview last step:
preview
$ result/bin/sd-cli --diffusion-model /run/media/green/d20c801b-7aae-4042-85ec-bf2153257be8/green/workspace/stable-diffusion.cpp/models/ernie-image-Q4_K_M.gguf --llm models/Ministral-3-3B-Base-2512.Q8_0.gguf -t 10 --vae-tiling --vae models/flux2/full_encoder_small_decoder.safetensors --sampling-method euler --scheduler smoothstep --steps 25 --cfg-scale 3 -p "A sceneic view of the Alps. A tree partially visible in the foreground. Greenery mixed with harsh rocks." -n "artwork, blury" -v -W 1600 -H 1152 --offload-to-cpu --diffusion-fa --preview proj


Setting the correct flow shift value --flow-shift 4 makes the problem appear a little later, but is still severe at full hd. #1433

@Green-Sky
Copy link
Copy Markdown
Contributor

I converted their ministral prompt enhancer finetune to gguf for llama.cpp.

https://huggingface.co/Green-Sky/Ernie-Image-Prompt-Enhancer-Ministral-3B-GGUF

It contains the correct system prompt.

run it like this:
$ llama-cli -m Ernie-Image-Prompt-Enhancer-Ministral-3.8B-Q8_0.gguf -st -p "A photo of a lovely cat"
or their json format:
$ llama-cli -m Ernie-Image-Prompt-Enhancer-Ministral-3.8B-Q8_0.gguf -st -p "{\"prompt\": \"A lovely cat.\", \"width\": 1024, \"height\": 1024}"
or with automaticly downloading the model:
$ llama-cli -hf Green-Sky/Ernie-Image-Prompt-Enhancer-Ministral-3B-GGUF:Q8_0 -st -p "{\"prompt\": \"A lovely cat.\", \"width\": 1024, \"height\": 1024}"


One issue I have is that it really likes to produce drawings, even when "photograph" is specified. Also features specified in the prompt never occur. Using the to english translated prompt works however.

eg: "A photo of a lovely cat"

result:

这是一张高分辨率的自然光写实摄影照片,采用浅景深构图。画面主体是一只浅橘色的长毛虎斑猫,以侧身姿态站立在深棕色木质地板上,头部转向镜头正前方。猫咪拥有圆润饱满的头部,双耳竖立,耳尖略带粉红色,面部覆盖着细密的白色与浅橘色斑纹,鼻子呈淡粉色,双眼为明亮的湛蓝色,眼神清澈直视镜头。其颈部与胸部被蓬松的白色长毛覆盖,毛发质感柔软,呈现自然的层叠纹理,四肢末端的白色毛发清晰可见。背景为室内环境,左侧可见模糊的白色窗帘,后方有深棕色木质护墙板,右侧边缘露出部分白色沙发靠垫,地板呈现深棕色木质纹理,背景物体均因景深效果而呈现柔和的虚化状态。整体光线柔和自然,色调温暖,焦点精准聚焦于猫咪面部,毛发细节清晰,营造出温馨舒适的居家氛围。图片中无任何可见文字。

firefox translation:

This is a high-resolution natural light realistic photographic photograph using a shallow depth of field composition. The main body of the picture is a light orange long-haired tabby cat, standing sideways on the dark brown wooden floor, the head turning to the front of the lens. Cats have rounded and full heads, ears are erected, slightly pink ears, faces are covered with fine white and light orange markings, the nose is light pink, the eyes are bright blue, and the eyes are clear and straight to the lens. Its neck and chest are covered with fluffy white long hair, and the hair texture is soft, showing a natural cascading texture, and the white hair at the end of the limbs is clearly visible. The background is the indoor environment, the left side can be seen blurred white curtains, the rear has a dark brown wooden wall panel, the right edge of the white sofa cushion, the floor presents a dark brown wood texture, the background objects are softly blurred due to the depth of field effect. The overall light is soft and natural, the tone is warm, the focus is precisely focused on the cat's face, and the hair details are clear, creating a warm and comfortable home atmosphere. There is no visible text in the picture.

image (not using the translation):
output

image (using translation):
output

This happens very consistently and strongly smells like tokenizer issue for han script.

@Green-Sky
Copy link
Copy Markdown
Contributor

You can use the mmproj files from the base model to prompt it using an existing image too. So captioning.

But while testing that and providing the request to create a prompt for the image in the json format, it glitched and created a prompt of itself. 😆

command:
$ llama-cli -m Ernie-Image-Prompt-Enhancer-Ministral-3.8B-BF16.gguf -st -p "{\"prompt\": \"Provide prompt in english and enhanced.\", \"width\": 1024, \"height\": 1024}" --image girl_portrait.png

resulting prompt:

一张现代感十足的AI图像生成系统操作界面截图,展示了用户输入的原始提示词与系统生成的增强版提示词对比。画面构图为正方形,中心是一个带有圆角的深灰色半透明对话气泡窗口,窗口顶部居中显示白色文字‘AI Prompt Optimizer’。窗口内分为上下两个区域:上方区域标注为‘Original Prompt’,文本框内写有英文内容:‘a futuristic city with floating islands, neon lights, and flying cars’。下方区域标注为‘Enhanced Version’,文本框内详细展示了增强后的描述:‘A hyper-realistic 3D rendering of a futuristic metropolis at night. The scene features towering skyscrapers with intricate architectural details, massive floating islands with organic flora, and a sky filled with glowing flying cars and auras. The atmosphere is vibrant, with deep blue and magenta neon lighting creating a high-contrast cinematic composition.’ 窗口底部右侧有两个交互按钮,左边是浅灰色的‘Reset’,右边是亮蓝色的‘Apply Enhancement’。背景是模糊的深色调科技感工作室环境,带有微弱的蓝色光影。文字均清晰可见,排版整齐,具有现代UI设计的精致感。

translated to english:

A screenshot of the modern AI image generation system operator interface shows the user input of the original prompt and the system-generated enhanced prompt word contrast. At the center is a dark gray translucent dialogue bubble window with rounded corners, with the middle of the window showing the white text ‘AI Prompt Optimizer’. The window is divided into two areas: the upper area is marked ‘Original Prompt’, and the text box reads ‘a futuristic city with floating islands, neon lights, and flying cars’. The area below is marked ‘Enhanced Version’, and the enhanced description is detailed in the text box: ‘A hyper-realistic 3D rendering of a futuristic metropolis at night.’ The scene features towering skyscrapers with intricate architectural details, floating massive islands with organic flora, and a sky filled with glowing flying cars and auras. The atmosphere is vibrant, with deep blue and magenta neon lighting creating a high-contrast cinematic composition. There are two interactive buttons on the right side of the window, a light gray ‘Reset’ on the left and a bright blue ‘Apply Enhanced’ on the right. The background is a fuzzy dark-toned tech-sensing studio environment with a faint blue light and shadow. The text is clearly visible, the layout is neat, and it has the exquisiteness of modern UI design.

resuling image from english translation:
output


better working command:
$ llama-cli -m Ernie-Image-Prompt-Enhancer-Ministral-3.8B-BF16.gguf -st -p "Provide a prompt for the given image." --image girl_portrait.png

resulting prompt:

这是一张在户外自然环境中拍摄的全身人像照片。画面主体为一名年轻女性,侧身站立,头部转向镜头方向,目光直视前方,面带微笑。她留着黑色长卷发,刘海自然垂落在额前,身穿一件浅蓝色的无袖连衣裙,裙摆长度及膝,脚踩一双白色帆布鞋。她的左手自然下垂,手腕上佩戴着一条细手链。背景是一片生长茂盛的草地,草丛中有零星的白色和黄色小野花。远处是模糊的树林,显示出景深较浅的虚化效果,营造出柔和自然的氛围。光线明亮均匀,呈现出自然的日光色调,阴影柔和,没有强烈的明暗对比。画面构图以人物为中心,背景的绿色与人物的蓝色裙子形成清新的色彩搭配。图片中未见任何文字、水印、Logo或其他标记。

translated to english:

This is a full-body portrait taken in the outdoor natural environment. The main body of the picture is a young woman, standing on the side, the head turned to the direction of the lens, eyes straight ahead, smiling. She kept black long curly hair, and bangs naturally fell on her forehead, wearing a light blue sleeveless dress, skirt length and knees, and a pair of white canvas shoes on her feet. Her left hand naturally sagging, wearing a fine bracelet on her wrist. The background is a lush grass with scattered white and yellow wildflowers in the grass. In the distance are vague woods, showing a shallow blurring effect, creating a soft and natural atmosphere. The light is bright and uniform, showing a natural daylight tone, a soft shadow, and no strong light-dark contrast. The picture composition is centered on the person, and the green of the background is matched with the blue dress of the character. No text, watermark, logo or other tags are found in the image.

resulting image:
output

So it kinda works.

https://huggingface.co/unsloth/Ministral-3-3B-Instruct-2512-GGUF/blob/main/mmproj-F16.gguf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants