Exactly a year ago, I published my proven ChatGPT formula to achieve consistent great ‘photos' with AI (Midjourney). It has helped me produce many very good-looking, realistic images. But with the pace at which improvements are released, does the formula stand the test of time? Or is it due an update? This is what we aim to find out today.
Before we dive in, as I did in the original post, I want to clarify that whatever you achieve with this formula is not a photo: it's an image. And I don't highlight this difference only for semantics.
I hope you will not use the formulas I publish on this blog to try to fool everyone into thinking you have taken a photo. Perhaps you can get away with it on social media and avoid the backlash of being spotted, but you won't land a job at National Geographic with it.
My goal with all I share here is simply to help you create beautiful images you can enjoy. Please refrain from using this to cheat.
And with this disclaimer out of the way, let's begin.
A Different AI Scenario
As already mentioned, my formula helped me consistently achieve great ‘photos' with AI. However, the scenario was quite different a year ago. ChatGPT was still an unchallenged force in the world of large language models (LLM), while Midjourney had some stiff competition with image generation.
Today, their reality seems to be the exact opposite, or at least that's how I feel about it. Midjourney reigns supreme (quality-wise, at least), while ChatGPT's top spot is challenged daily.
Because of this shift, I will examine how many different language models respond to the formula and then let Midjourney do its magic.
The Formula
So, let's start by seeing if different LLMs output different prompts. I will keep using the free versions for this test, to keep this usable for everyone. For now, I will be using the same set of instructions and ask for a single result:
A photograph of a [subject, one or more] [engaged in an action scene] with [background context] during [time of day] with [type of lighting] and shot with a [type of camera and lens: brand, focal length and aperture] using [type of composition] and captured on a [type of film or film simulation]
All the details about this formula are written in last year's article, so I won't write them down again.
I'm not going to digress into each LLM's offer because that is already extensively covered online, and it would take way too long (and a few different posts). Here, I am sticking to the purpose of this post, which is to provide you with a proven formula to achieve consistent great images with AI generation.
By the way, please mention this blog as your source if you re-use the formula or present it in your social media content… Let me know if I was able to help you!
Now, before we start exploring the different LLMs, let's talk about Midjourney's new version.
Midjourney v6
A few months ago, Midjourney released version 6. One of the most exciting new features is the ability to directly incorporate text into your generated images. Which may play a part in our image generation, if we have signs, billboards, newspapers, etc.
Version 6 also focuses on generating images with:
- better coherence and consistency;
- smoother transitions between elements;
- more logical compositions;
- a stronger sense of unity throughout the artwork;
- greater visual diversity, exploring a wider range of styles and techniques.
More importantly for our test, Midjourney v6 has improved its prompt following and can handle longer prompts. When I first created the formula, my outputs were longer than the maximum allowed number of tokens (72, 40+ words), though they still produced great images. Now, it seems that the limit is pushed higher almost ninefold (350+ words).
In theory, we should avoid complex prompts. The longer they are, the more tasks Midjourney needs to perform and the more relationships between keywords (giving less statistical weight to each) need to be figured out.
But it has never been an issue with the formula. And even less now with v6.
An Even Playfield?
There are also new prompting methods. You can now use hashtags, brackets, and slashes to modify your prompts and get different effects (still in an Alpha stage). This should work well with AI responses that have brackets included, particularly Gemini. In fact, it seems that Google keeps its language model updated to the most recent resources, at least more than its competitors (some of which are actually disconnected from the live Internet).
You may think Google has an unfair advantage in this regard, but so should Microsoft, and yet you will see that its model failed once.
Testing LLMs
OK, so let's dive in! I asked each LLM the same question: “Please give me 1 example for a Midjourney prompt based on this formula; they can be any genre. Here is the formula [formula as above]”.
What follows are the AI answers and the relative image generated with Midjourney v6.
I am also interested in seeing what kind of scenario the AI “thinks” of when reading this formula for the first time. I am not going to provide extra inputs, like a photography genre, for example. Will the AI give me a wide-angle landscape, or will it go close-up and be personal? How much context can it create without me adding data?
To test with a specific set of data, I could simply change one or more variables. Say I wanted Batman in all images, I would swap a [subject, one or more] with Batman. If I wanted to focus on Street Photography, I would specify that. And maybe this will be material for a future article…
Anyway, it's time to look at some prompts and images. All of the pictures are generated in a 2:1 aspect ratio (using the –ar 2:1 parameter).
OpenAI ChatGPT
Of course, we begin with the model we first used. A year on, ChatGPT 3.5 is still the base engine for the free version. But now, you get a limited daily amount of requests you can pass to the new ChatGPT 4o. Let's try both to see if there is any difference.
Tip: while you are using your 4o daily allowance, you can switch to 3.5 to ask for a different result. Just click on the thunder icon on the bottom left of the reply.
3.5
A photograph of a bustling street market in Marrakech, with vendors haggling over fragrant spices amidst a labyrinth of vibrant stalls during the early morning hours, illuminated by the soft glow of dawn, shot with a Fujifilm X-T4 and a 23mm f/1.4 lens using a street photography composition and captured on Fujifilm Classic Chrome film simulation.
You can probably tell how well these depict Marrakech. ChatGPT's precise context makes a difference in the final image.
4o
A photograph of a street artist spray-painting a mural on a gritty alley wall in downtown Tokyo during nighttime with vibrant, neon lighting and shot with a Sony A7R IV and 35mm f/1.4 lens using a wide-angle, rule-of-thirds composition and captured on a bold, high-contrast black and white film.
Annoyingly, one of the images doesn't abide by our instructions and is generated in colour. But you can see from the signs that, while they mean nothing, they make us imagine this is Japan.
At first glance, there doesn't seem to be much difference in the prompts. Maybe 3.5 gives more details, which is surprising, but they're both similar.
Photography-wise, 4o uses film on a digital camera, where film simulation would have been more correct.
On the other hand, 3.5 seems to understand the Fujifilm ecosystem very well and recommends using a lens well suited for Street Photography.
But in the end, these are merely instructions to create consistent, great ‘photos' with AI. And I think that using this combination of ChatGPT and Midjourney still works.
Anthropics Claude AI
Next, let's have a look at ClaudeAI from Anthropics, arguably the first LLM to rival ChatGPT. When asking ClaudeAI to use the formula, this is its answer:
A photograph of a professional skateboarder [subject] [engaged in an action scene] performing a mid-air trick, with [background context] a gritty urban landscape in the background during [time of day] the golden hour with [type of lighting] warm, diffused lighting, and shot with a [type of camera and lens] Sony Alpha 7R IV with a 24-70mm f/2.8 lens at f/4 using [type of composition] a low-angle, rule-of-thirds composition and captured on a [type of film or film simulation] Kodak Portra 400 film simulation.
Annoyingly, ClaudeAI writes down our bracketed variables in the answer. The text in brackets comes from the formula and is too generic to make a difference with the new enhanced prompting in Midjourney v6. So, we will have to clean this up before we can generate the images.
I really like the fourth one (bottom right). Again, the Sony A7R IV is mentioned, which we know is quite popular.
Perhaps Midjourney should have fully utilised the 24-70mm zoom capabilities, but it's a start.
Currently, ClaudeAI, at least in its free version, does not save your conversations. So, you will have to restart every time you refresh the page or visit again. This may change, eventually.
Google Gemini
Now, let's have a look at how Gemini, the AI from Google, responds.
A photograph of a trench coat-clad detective (male) [smoking a cigarette] hunched over a dead body sprawled on the rain-slicked pavement of a deserted city street. Neon signs cast harsh shadows from towering skyscrapers in the background. The scene is shrouded in fog during the early morning hours with only a single street lamp providing light. Shot with a wide-angle lens (Leica M3 with a 35mm Summicron lens, f/5.6) using a leading lines composition to draw the eye towards the body. Captured on digital with a classic black and white film simulation emphasizing grain and contrast.
Gemini produced a longer and more descriptive prompt, which is in line with what Midjourney requires in terms of prompts these days. I assume it's probably due to having direct and updated access to online resources (as expected from Google).
Aside from the brackets, it feels more natural and creative, as if it were telling a story. However, some details are incoherent, like having “neon signs cast harsh shadows” together with “only a single street lamp providing light”.
Enhanced prompting test
This is a good time to test the new prompting methods in the Alpha stage. If we modify the first sentence, we can try to get different results using a single prompt.
Like so: A photograph of a trench coat-clad detective (male) [smoking a cigarette/writing notes/holding an umbrella/with a gun in his left hand] …
You can see why this is still in the Alpha stage. The variations produced some differences in the images, but they did not follow our instructions (and are in a different order). The gun is in the right hand, the umbrella is held by someone else, while drunk from the whiskey he lost the glass, the notebook is not there.
But this has lots of potential, so let's hope the Midjourney developers will refine the feature.
As per our ‘consistent great photos with AI' task… They're pretty good, right?
The red accent colour feels a lot like Sin City, so have we just found out one source used to train the model?
Microsoft Copilot
So, let's see Microsoft…
With Copilot, you have access to GPT 4 without paying the subscription. Which is probably the main reason to choose this model.
Out of the box, you can choose between 3 different conversation styles: Creative, Balanced, or Precise. So, let's see how each performs.
Creative
A photograph of a group of friends laughing engaged in an action scene of throwing colorful powder at each other with a festival atmosphere in the background during midday with bright, natural sunlight and shot with a Nikon D850 and an 85mm f/1.4 lens using a centered composition and captured on a Fujifilm Velvia 50 film simulation.
Balanced
- Style: Epic fantasy painting
- Subject: A dragon and a knight
- Setting: A mist-covered mountaintop
- Composition: Dramatic close-up
- Lighting: Moonlight piercing through clouds
- Additional Info: Ancient ruins in the background
- Prompt: “An intense battle between a fierce dragon and a valiant knight atop a mist-shrouded mountain peak under the moonlight, with crumbling ruins in the distance.”
Precise
A photograph of a mother and child sharing a moment of laughter in their rustic kitchen during late afternoon with warm, natural lighting streaming in from the window. The shot was taken with a Sony Alpha a7 III and an 85mm f/1.8 lens using a frame within a frame composition and captured on a Ilford HP5 Plus 400 film simulation.
Now, I don't know how the “Balanced” version decided to go so far off-piste. The list it returned as an answer is unusable for our purposes.
But the other 2 examples seem totally fine and in line with the rest.
Also, the number of answers you can generate seems to be limited (remember, I'm using the free version), so you may need to restart after a while.
OK, there's a lot to process here. To begin with, these indeed replicate wide-aperture 85mm shots (you'd probably capture the mother-child from a longer distance). The composition and light are spot-on.
The Nikon D850 is also a favourite among portrait photographers, so it made sense in the prompt. The Sony is not the best, so you would mask the poor skin tones with monochrome (though I doubt it's IlfordHP5-like).
What is really evident here is how good Midjourney has become with portraits. Though looking at the Precise style generation, it may still need to work on ethnic diversity.
Perplexity AI
I access Perplexity mostly on the go from its iPhone app, but it is also available on desktop browsers. I think the answer it gives is very interesting:
Prompt: A photograph of a couple walking hand in hand with a graffiti-covered alleyway in the background during the late afternoon with natural, diffused lighting and shot with a Canon RF 50mm f/1.8 STM lens using leading lines composition and captured on Kodak Portra 400 film.
Explanation: This prompt highlights a candid moment of a couple in an urban setting, with the graffiti adding a vibrant, artistic element. The late afternoon lighting provides a natural, soft illumination, and the Canon RF 50mm f/1.8 STM lens is perfect for capturing intimate, detailed shots. The leading lines composition draws the viewer's eye through the image, and the Kodak Portra 400 film is known for its excellent color reproduction and fine grain, enhancing the overall aesthetic.
I left the entire answer because it's peculiar: Perplexity adds the reasons why it gave its answer.
This is fascinating as an insight into how the AI worked its magic, but it also provides more context that could be used to refine the input into Midjourney.
Speaking of context, and based on my travels, these feel like photos taken in Toronto, don't they? Except maybe the third (bottom left) which reminds me more of Belfast.
Mistral AI
Lastly, let's have a look at the French LLM, Mistral. Lately, this has been in the news for its impressive capabilities, proving to be a very serious contender in the AI space.
A photograph of a group of mountain climbers suspended on a steep cliffside, with snow-capped peaks towering in the distance during the early morning, with soft golden light illuminating the scene, shot with a Nikon D850 and a 24-70mm f/2.8 lens using the rule of thirds composition and captured on a Kodak Portra 400 film simulation.
There is incredible detail in the landscape, that is for sure. But without providing specific input, would I choose a scene like this as my default AI-generated image? I don't know. But this is the only AI that chose a natural landscape out of the box rather than a city.
Consistent Great Photos with AI? Let's Wrap This Up
Each of the AI Large Language Models seems to perform the requested task very well. With one exception: the Microsoft Copilot “Balanced” style. So, in terms of passing the test of time, the formula to generate consistent great photos with AI seems to have fared well.
Ultimately, we want to find which of the tested models gives us the best prompts to produce consistent great photos with AI.
How Did Each LLMs Performed?
Only ChatGPT gave us context in the form of a well-known city, meaning the resulting image should feel more familiar to the viewer. In general, ChatGPT seems to favour context over subject when crafting the prompt. So, this might be something to keep in mind if we want to achieve that.
Google Gemini seems to be using the very last resources available from Midjourney to craft the prompt: more descriptive, using brackets, etc. Midjourney v6 wants prompts that feel more natural, like if you were conversing with it. So, as things stand, this may now be the preferred choice.
Microsoft Copilot went very personal, putting people and emotions front and centre (when it worked). It's the only one that decided to go so intimate on its very first attempt. Perhaps if you want to create portraits, this may be the way to go?
Claude AI is unusable for this specific purpose, as it writes a prompt that requires further polishing and is not so mind-blowingly good that it is worth the hassle.
Perplexity adds a full explanation of how it made its choice. And Mistral seems to prefer landscape as its first output.
How Was Their Output, From a Photographer's Perspective?
In terms of gear, there seem to be favourites among all AI models: the Sony A7 series and the Nikon D850 are definitely the most popular cameras. With Fujifilm and Kodak being favourites as film simulations. There seems to be more variety with lenses, though a wide Aperture is a constant (and Midjourney handles depth of field well).
The composition and light varied every time, which was great because it allowed us to produce more creative outputs.
I noticed that the subjects tend to be positioned in the left half of the image and look or move towards the right. It would be interesting to know why that is and if this has anything to do with the developers being a majority of left-to-right readers.
Interestingly (or annoyingly), Midjourney doesn't always give us four straight black-and-white images when instructed to do so. It seemed to do a better job when ChatGPT asked for “black-and-white film” rather than “black-and-white film simulation” in the other tests. Maybe it's something to consider.
Is There a Winner?
There is nothing wrong with any of the AI models except for one hallucination. But to me, there are two that stand out: Copilot for the choice of going more intimate and Gemini for the prompt crafting. And since our goal is always to find the best tool for the job, I'd say Gemini comes out as our winner today because it gives us the best, up-to-date, prompt with our formula.
So…
Now, the initial question was, “can we still create consistent great photos with AI” in the new 2024 scenario, using the proven formula? I suppose we can.
Does the formula need an update? Perhaps not just yet. But when the new improvements to Midjourney will exit the Alpha stage, it probably will. So come back here to see where this will take us.
In the meantime, here is my favourite image from this test, produced by the Copilot-Midjourney combo.
The runners-up are Claude (for its use of low-angle composition, though the hand and face details are poor) and Gemini (for the storytelling and cinematic vibe).
So, here we are at the end. Please share your thoughts. And let me know in the comments (further down on this page) if you would like me to elaborate further on the topic.
Cheers!