Yuehao Gao
Assignment 1
10/01/2024
----------------------------------------------------------------------
----------------------------------------------------------------------
Prompt 1: "A Subaru WRX towing the Titanic Ship behind it using a thick iron anchor chain, on the surface of the ocean, in huge waves and stormy rain. The wheels of the WRX are splashing waters behind it. The picture should be in an artistic brush-painting style."
--------------------------
Image 1A
Model: DALL-E 3
Result:
1. To what degree does your text query influence the generated image?
On a scale of 1-10, the text query influenced the generated image at an approximate level of 6.
It is indispensable to say that most of the elements specified in the prompt are "shown" on the graph, including the WRX car, the ship, the stormy weather, the waves, and the anchor chain. However, nearly all of the elements are placed in an unreasonable or incorrect alignment: for instance, the car is zooming towards the ship rather than "towing" it and facing away; the chain is not working as a "tow rope", but is stretched alongside the car and the ship; the Titanic has five chimneys rather than four. At the same time, the style is not adhering to the "brush-painted style" as specified. This is to say, the picture works as an "assembly of the elements" after trying to understand the prompt but not creating the accurate artwork as wanted.
2. What is the style of the image, and why do you think it has produced that?
The prompt has specified the style to be "brush-painting." While the clouds in the sky, the chimneys, and the body of the Titanic have a little sense of brush painting, other elements, especially the car and the ocean waves, are more likely a "realism" style.
3. Any thoughts about how the visual elements in the image are organized
The model of DALL-E is understanding the elements little by little so that it generates the visual elements more discretely. This is to say, each visual elements make sense by itself, but its relationships with the other visual elements are not as accurate as how the prompt specified. For instance, the WRX is going to run into the Titanic rather than towing it forward.
4. How would you change the query?
I may add more details describing some elements and their position, like "a thick, rusted iron anchor chain over a raging ocean", "The WRX’s wheels throw up arcs of water, struggling against the powerful current", and "the Titanic looms behind". Especially, I would specify the style for the whole picture in the words "The entire scene is captured in a bold, expressive brush-painting style, with sharp, dynamic strokes giving the stormy sky and the water a sense of chaos and movement".
5. Any other comments?
Generally, I do agree that DALL-E 3 knows what each element should look like, but there could be a lack of enough visual data for the model to imagine how "a race car towing a ship on the ocean surface" should look like, hindering the model from generating a precise model.
6. On a scale of 5 - from 5 being GREAT to 1 being LOW your rating of the result
Based on all the analysis given above, I would personally give a 3 for the result.
--------------------------
Image 1B
Model: Midjourney
Result:
1. To what degree does your text query influence the generated image?
On a scale of 1-10, the text query influenced the generated image at an approximate level of 8.
Compared to the model of DALL-E 3, it is obvious that Midjourney understands each element in the prompt even better, especially regarding the position of each element, as well as the overall picture style or textures. For instance, it features a clear and precise front face of the WRX, as well as the body styles of the Titanic, together with the waves and the wheel-splashed water stream. While the position of the car and ship is placed in a position that correctly interprets a "towing" relationship, the chain on the side still seems to be coming out from nowhere. Nevertheless, the text query does influence the generated image more in this model.
2. What is the style of the image, and why do you think it has produced that?
The prompt has specified the style to be "brush-painting." Overall, the entire picture slightly features the texture of an oil painting, especially in the ocean surface and the raindrops.
3. Any thoughts about how the visual elements in the image are organized
Potentially, the model understood the prompt as a whole so it organized the elements in the picture reasonably. Specifically, among all the 4 variations it generated, it seems like the model prioritized putting the WRX in the very middle of the picture and everything else behind it, as it is the very first subject-object that appeared in the prompt
4. How would you change the query?
While everything looks almost perfect in this picture, I would specify more about the chain, like "a rusty, thick anchor chain that is straightened with great tension between the WRX and the front of the ship, as well as more details like "a Rally-blue WRX with strong power feeling" and "a wrecked, giant Titanic ship with a strong vide of history".
5. Any other comments?
Compared to DALL-E 3, it is obvious that Midjourney understood the logic of the prompt better using its dataset collected from Discord user inputs, despite some flies in the ointment like the position of the chain.
6. On a scale of 5 - from 5 being GREAT to 1 being LOW your rating of the result
Based on all the analysis given above, I would personally give a 4 for the result.
----------------------------------------------------------------------
----------------------------------------------------------------------
Prompt 2: "A creative stream of flow bursts out of the top of a classical, old, wooden straight piano, while the flow consists of sparkling, high-tech musical notes glowing bright lights. The entire picture should be in the style of contemporary kinetic art."
--------------------------
Image 2A
Model: DALL-E 3
Result:
1. To what degree does your text query influence the generated image?
On a scale of 1-10, the text query influenced the generated image at an approximate level of 8.
For this prompt, it is obvious that more detailed descriptions like "classical, old, wooden" made the model illustrate the piano object very precisely. Additionally, the notes are aligned with the prompt as they are "sparkling" and "bright" enough. Still, specifying that the stream should come from the top of the piano did not play a great role as the model understood it as "the top of the piano keyboard". Meanwhile, the word "technical" did not make the notes look like digital chips or compartments as expected.
2. What is the style of the image, and why do you think it has produced that?
The prompt has specified the style to be "kinetic art". Indeed, the model captured this and made the stream of notes look dynamic and energetic enough.
3. Any thoughts about how the visual elements in the image are organized
The generated picture prioritized the flow of notes over the piano since the flow occupied almost 75% of the picture, which, is reasonable since the prompt has implicitly emphasized that as the major element in the picture.
4. How would you change the query?
I might change the query by adding more specifications to the word "high-tech", like crystal-textured or neon-glowing.
5. Any other comments?
Compared to the first prompt, it is obvious that DALL-E 3 has a much more precise understanding of this one. Still, it seems like the model lacks some sort of "combination imagination" since it did not successfully imagine how musical notes should be "high-tech", and where is "the top of the piano".
6. On a scale of 5 - from 5 being GREAT to 1 being LOW your rating of the result
Based on all the analysis given above, I would personally give a 4 for the result.
--------------------------
Image 2B
Model: Midjourney
Result:
1. To what degree does your text query influence the generated image?
On a scale of 1-10, the text query influenced the generated image at an approximate level of 8 as well.
Compared to the DALL-E version of interpreting this prompt, there are some aspects that Midjourney did better, including making the notes shine even brighter, with the piano looking "older" by showing dark marks on its body, while some of the music notes floating in the air indeed look more "creative" in their shapes and design. Additionally, the flow does seem to be coming from the top of the piano. However, the prompt failed to make Midjourney understand what is "a stream of flow" as the notes are scattered, and showing everywhere in the picture". Like Dall-E, it also fails to imagine how notes shall be "high-tech". But still, it has generated a great result overall.
2. What is the style of the image, and why do you think it has produced that?
I would consider Midjourney to have the same level of interpreting a piece of "kinetic art" regarding this prompt.
3. Any thoughts about how the visual elements in the image are organized
In this picture, the piano is placed on the left side of the canvas, taking approximately 3/5 of the position. The stream of notes is drawn in a layer that is above the piano since the model understands that it is the major element as well.
4. How would you change the query?
Just like how I would change the query for the Dall-E version of this picture, I would specify more about how the musical notes would look more "high-tech", like the possible colors, textures, and technologies being utilized.
5. Any other comments?
Overall, the model captured every detail given in the prompt, despite having some flaws like misunderstanding how a "stream" should look like.
6. On a scale of 5 - from 5 being GREAT to 1 being LOW your rating of the result
Based on all the analysis given above, I would personally give a 4 for the result as well.
----------------------------------------------------------------------
----------------------------------------------------------------------
Prompt 3: (Prompt 2 changed from "kinetic art" to "abstract-impressionist art"): "A creative stream of flow bursts out of the top of a classical, old, wooden straight piano, while the flow consists of sparkling, high-tech musical notes glowing bright lights. The entire picture should be in the style of contemporary abstract-impressionist art."
--------------------------
Image 3A
Model: DALL-E 3
Result:
1. To what degree does your text query influence the generated image?
On a scale of 1-10, the text query influenced the generated image at an approximate level of 4.
This is one experiment I would like to do on both models: to change the style of the picture to something that human-artists might be more adept at: abstract arts. Despite the prompt mentioned the artistic style, the DALL-E generated a very similar picture compared to that of Prompt 2, and showed a very little style of being an "abstract art" as everything still seem very realistic.
2. What is the style of the image, and why do you think it has produced that?
Despite the style of the image is specified as "abstract-impressionist art", which is supposed to be something like random, distributed lines and shapes on the canvas, the generated piano, and the room setting still seems very realistic.
3. Any thoughts about how the visual elements in the image are organized
Like how Midjourney interpreted the previous picture, the piano is placed to the left side of the canvas, and it occupied about 75% of space. The notes are in the layer above the piano. For this picture, the camera seems to be zoomed-in more to focus on the center part of each elements, and we feel to be closer to the elements shown in the picture.
4. How would you change the query?
As the aim is to generate a piece of abstract-impressionist art, I would specify less about the piano and the stream of notes. Rather, I might add more description about the overall style, like "mainly consisted of random-feeling lines, color shapes on a completely-white canvas as the background."
5. Any other comments?
Generally, it seems like Dall-E doesn't really understand what an abstract-impressionist art, at least from the result of this prompt.
6. On a scale of 5 - from 5 being GREAT to 1 being LOW your rating of the result
Based on all the analysis given above, I would personally give a 2 for the result.
--------------------------
Image 3B
Model: Midjourney
Result:
1. To what degree does your text query influence the generated image?
On a scale of 1-10, the text query influenced the generated image at an approximate level of 5.
This could be due to the same reason as the Dall-E generation process of this picture, since a similar prompt was entered right before this request, and the models just utilized the previous picture as a reference and tried generating a variation based on that. It feels very similar than the picture it generated for Prompt 2, except it generated three streams instead of one. Despite the original misunderstanding of a "flow" into scattered and distributed notes gives a little sense of "impressionist art", which shall work as a signal that the query slightly influenced the generated image, it is obvious that the model still needs to take in a greater sample regarding the label "abstract-impressionist" art.
2. What is the style of the image, and why do you think it has produced that?
I would consider Midjourney to have the same level of interpreting a piece of "abstract-impressionist art" regarding this prompt.
3. Any thoughts about how the visual elements in the image are organized
The piano is placed in the bottom-middle part of the picture this time, similar as how Dall-E treated Prompt 2. However, what is different about the organization of this picture is that, Midjourney placed most of the "creative, high-tech musical notes" above the piano object, with a minimum about of overlapping.
4. How would you change the query?
Like how I would change the query for Dall-E, I would shift the focus on the element themselves to the style and drawing techniques of abstract-impressionist pictures.
5. Any other comments?
On one hand, despite Midjourney does not seem to understand what should be an "abstract-impressionist art" should look either, it does a slightly better job compared to DALL-E in this task overall. However, on the other hand, one possibility is that the misunderstanding from "one stream" to scattered and distributed notes have made the picture "hit the mark by a fluke".
6. On a scale of 5 - from 5 being GREAT to 1 being LOW your rating of the result
Based on all the analysis given above, I would personally give a 2.5 for the result, for its slightly closer interpretation of "impressionist art".