Automate Image Captioning using Multimodal LLMs

Using multi-modal large language models for automated image captioning. Rich captions can be used for training Stable Diffusion Dreambooth or LoRAs.

Video:

https://youtu.be/lBdfAB1SoKc

Links/Resources

Recognize Anything
https://github.com/xinyu1205/recognize-anything

Kosmos-2
https://huggingface.co/microsoft/kosmos-2-patch14-224

BLIP-2 OPT-2.7B 8-bit Quantized Model by Mediocreatmybest
https://huggingface.co/Mediocreatmybest/blip2-opt-2.7b_8bit

captionmodels

Leave a Reply

Your email address will not be published. Required fields are marked *