OpenAI Extends GPT-3 to Combine NLP with Images
A pair of neural networks unleashed by GPT-3 developer OpenAI use text in the form of image captions as a way of generating images, a predictive approach that developers said will help AI systems better understand language by providing context for deciphering the meaning of words, phrases and sentences.
The two models, an offshoot of OpenAI’s third-generation language generator, are dubbed CLIP and DALL-E and are detailed in Jan. 5 posts on the OpenAI blog. . Both neural nets are designed to generate models that understand both images and related text. The hope is those upgraded language models will be able to decipher images in a way that approaches how humans interpret the world.
CLIP, for Contrastive Language-Image Pre-training, learns visual concepts via natural language supervision gleaned from web images. OpenAI said CLIP works by “providing the names of the visual categories to be recognized.”
When applied to an image classification benchmark, developers said the model can then be instructed to perform a range of benchmarks without being optimizing for each test. “By not directly optimizing for the benchmark, we show that it becomes much more representative,” OpenAI said, claiming the CLIP approach closes the “robustness gap” by up to 75 percent.
DALL-E, a reference to artist Salvador Dali blended with Pixar’s WALL-E, creates images from text captions to express concepts. The 12 billion-parameter version of GPT-3 generates images from text descriptions rather than labeled data, providing the model with more context about meaning.
Developers refer to DALL-E as a “transformer language model” able to receive both text and image as a single data stream. “This training procedure allows DALL-E to not only generate an image from scratch, but also to regenerate any rectangular region of an existing image… in a way that is consistent with the text prompt.”
The result was a demonstrated capability to create “plausible images” based on text that reflected the subtleties of human language, including “the ability to combine disparate ideas to synthesize objects,” including an armchair in the shape of an avocado (shown).
DALL-E also extends a GPT-3 capability called “zero-shot reasoning,” a robust form of common-sense machine learning. OpenAI asserts DALL-E “extends this capability to the visual domain, and is able to perform several kinds of image-to-image translation tasks when prompted in the right way.”
“We live in a visual world,” Ilya Sutskever, chief scientist at OpenAI, told MIT Technology Review. “In the long run, you’re going to have models which understand both text and images. AI will be able to understand language better because it can see what words and sentences mean.”
Related
George Leopold has written about science and technology for more than 30 years, focusing on electronics and aerospace technology. He previously served as executive editor of Electronic Engineering Times. Leopold is the author of "Calculated Risk: The Supersonic Life and Times of Gus Grissom" (Purdue University Press, 2016).