An applied-research exploration: train and probe a CLIP-based Visual Natural Language Autoencoder, extending the NLA technique (overview) from language-model activations to image embeddings.
The aim is a credible publishable artifact — a written explainer and an open-source training and probing repo. A hosted demo will follow once the model is trained.
Entries log the daily decisions, dead-ends, and small wins as the work moves from “read the paper” to “shipped artifact.”