These creepy images were completely generated by AI from just photo captions

Oct 1, 2020

John Aldred

AI just gets weirder and weirder. And creepier. Researchers at the Allen Institute for AI have published new research which builds on OpenAI’s GPT-3 machine learning tech to generate images from scratch based just on the captions of photos.

It’s kind of the reverse of what Facebook does when you upload a photo to the platform and it generates captions. Here, you feed it captions and it generates the photo.

GPT-3 is part of a group of AI models known as “transformers”, which became popular with the success of Google’s BERT language system. BERT’s so good at understanding language that Google now uses it to provide more relevant results through its search engine.

Before BERT, AI language models weren’t that great, but Google changed the game by introducing a technique called “masking”. This technique essentially replaced a word with nothing and asked the model to fill in the blank. A couple of examples mentioned on the MIT website include…

The woman went to the ___ to work out.
They bought a ___ of bread to make sandwiches.

It forces the language model to examine these sentences and try to fill in the blank, usually millions of times. It then can understand the patterns of languages, sentences and paragraphs. It gets better at understanding the meaning of language itself.

This model was extended to include images. The blank was still there, but an image was provided to help provide it with some assistance in identifying what the missing word was. This training now means that not only does the AI understand the context of language and figure out what the missing word is, but it also has some understanding of what it looks like – at least that was the theory.

To test this, they fed the AI some words and asked it to spit out images. A bit like asking a child to draw something from memory to see if they really knew what it was. And, well… The results were, well, pretty horrifying. This is supposed to be “A giraffe standing on dirt ground near a tree”.

The issue was one of context. To us humans, there are a lot of implications that we just assume given this caption. We know roughly what a giraffe looks like, we know what colour dirt is, and what dirt ground might look like. Chances are, though, most of us will be imagining different trees in our heads, depending on where we are in the world and what is common to where we live.

The trick with this new research was to see if it could teach the machine to figure out all this implicit visual information and context that our brains take for granted. And, well, while the results still aren’t perfect, we can certainly see how the AI came to the conclusions it has and can see the direction it was heading.

While it still has some learning to do, it’s quite frightening just how far the technology has come and how close it’s getting to figuring out what some things look like when just given a brief description. In a few more years, it might become impossible to distinguish the AI-generated from real photographs.

[via MIT Technology Review]

Filed Under:

news

Tagged With:

John Aldred

John Aldred is a photographer with over 25 years of experience in the portrait and commercial worlds. He is based in Scotland and has been an early adopter – and occasional beta tester – of almost every digital imaging technology in that time. As well as his creative visual work, John uses 3D printing, electronics and programming to create his own photography and filmmaking tools and consults for a number of brands across the industry.

Join the Discussion

DIYP Comment Policy
Be nice, be on-topic, no personal information or flames.