Media Arts and Technology

Posted: **Fri Sep 16, 2022 8:03 am**

wk04 - MidJourney 3rd + Analysis Report

This is the final posting for MidJourney as we are moving on to Stable Diffusion. Please post new images of consequence, describing your approach, and also add an Analysis Report, written into the Student Forum (if its a pdf, we then have to download it).

The Analysis report should give na insight of your observations of how the softwarae works - for instance, I am noticin that there is always a frontal pont-of-view, that with multiple iterations, a gothic, stylized painting aesthetic takes over, and variations are sometimes repeating. Also out of all of the images, I ahve managed to save one at the high resolution of 2048 x 1152 but most are at a lesser scale.

Posted: **Thu Oct 20, 2022 1:23 pm**

Following up on previous weeks' project, I continue on exploring the Midjourney AI's ability to "tell" a story through its image composition, or how we as users/creators are able to tell a story with the help of Midjourney. As I mentioned last week, if we imagine telling a story in cinematic terms, we should often start our "film" with an establishing shot (establishing shots in filmmaking set up, or establish, the context for a scene by showing the relationship between its important figures and objects; they are typically wide or extreme wide shots of buildings or landscapes.) Then the film will often cuts to a close up of an object/figure in order to introduce the object/figure depicted. This time, I tried in my personal channel with the same prompts I used in the general channel and I am wondering what is the learning model of the Midjourney AI. What is the difference between my two attempts across different times?

Prompt: "Wangechi Mutu, low skyscrapers, buildings carpeted with grass and smiley-faced flowers, pollens, animations from the early 1900s, close up"
This is what I got from my own channel:

And this is what I got last time from the general channel:

They are of course very different--you would not get the same image twice through Midjourney, or any kind of text-image generator I believe. And this time it didn't generate the aesthetics I like on the last time (image 3). However, there are still so many similarities: the style of color pencil drawing remains--largely thanks to the "animation" prompt; the color pattern is similar and there are the similar eye/flower objects in the generated images--because of Wangechi Mutu prompt I assume. What I found more interesting was that the new generation picked up pollens more literally and I'm wondering is it because they've learnt from the previous variations I did last week or do they have a random pattern of picking up words in the prompt every time.

However, I was trying to look for some images that were more centralized on representing the pollen influence in an urban environment. After a couple of variations (of the second image of the first generation), I finally got something like this:

A "pollen storm" emerged in the first image of this generation and a variation of this image gave me something like this:

The pollen effect became even more obvious which I really appreciated. As Jack would say: variations are the key for Midjourney creation.

Then I tried with another prompt, adding "storm" into the original one I had:
"Wangechi Mutu, afrofuturistic skyscrapers, buildings carpeted with grass and smiley-faced flowers, grass pollens, storm, animations from the early 1900s, close up"
Then I got something like this:

This generation clearly indicated that the Midjourney AI didn't quite sure what words should be picked up the most so they assigned four different models of weight distribution: the first one (which I liked the most) picked up pollen and storm and skyscrapers more closely to my imagination; the second one picked up storm more literally; the third one only focused on the flowers and pollens (this form appeared quite frequently in all my generations and my assumption is that they picked up close up more literally); the forth one showcased the oversized flowers, reimagining the botanical subjects as architectures. Followingly I did variations of the first image of the pollen storm:

The variations closely followed the form and the composition of the original image with the two pollen balls floating in the sky. Would that be how a storm form like in the futuristic plant-based world?

I turned to make images of the main character of the story: Anwulli, a pregnant woman dependent upon her home robot Obi-3. To imagine a robot home is really difficult for Midjourney--often they only pick up robot in the prompt:
Prompt "Wangechi Mutu, a pregnant woman, a robot home, pollens, animations from the early 1900s, close up"

Similarly, the AI gave out four different images with different attention to specific parts of the prompt. Apparently, the AI gave more weights to the robot prompt specifically and the aesthetics of how a robot could look like became the dominating style of the images. Trying to get rid of the "robot" aesthetics, I tried with another prompt:
"Wangechi Mutu, a pregnant woman in a smart home, pollens, animations from the early 1900s, close up"

From this generation, I found it interesting to actually think about how much an AI can understand “propositions” in the prompt. For the image here, the proposition "in" was essentially not being picked up by the AI. Then how should we manipulate the composition and the spatial/orientational relationship between different figures in an image? I then tried this one prompt:
"Wangechi Mutu, low skyscrapers carpeted with grass and smiley-faced flowers, a pregnant woman, grass pollens, machine home, animations from the early 1900s, middle shot"

The spatial relationship was obviously predetermined or pre-given by simply juxtaposing words in your prompt. My assumption is that the proposition is automatically given lower weight on and the lack of spatial/orientational cue or the ability of picking up those cues cause the lack of pov/composition in Midjourney-generated images.

Posted: **Fri Oct 21, 2022 4:50 pm**

Midjourney Usage Report

If you are looking for executing a well defined idea, do not use Midjourney. If you desire control in the way common to editing tools such as Photoshop, Grasshopper, Blender, do not use Midjourney. Midjourney is a nondeterministic text-to-image generator. The training data are pairs of text and image. For the majority of (image,text) pairs, I assume the image to be scraped from the internet, where text is the “alt text” corresponding to the scraped image. My assumption largely derives from how Stability Diffusion works, as that is the only mainstream text-to-image generator that is open source. See the following article for more details on the data set of Stable Diffusion: https://waxy.org/2022/08/exploring-12-m ... generator/. Some interesting things to note is that a large set of (image,text) pairs derive from stock photo websites, online shops, Pinterest, and other image-centered hobby sites.

Here are a couple examples of some images and their corresponding alt-text (For a larger list of examples, poke through here: https://laion-aesthetic.datasette.io/la ... pls/images):

Alt Text: “Halo The Fall Of Reach Is Coming Entertainment Focus”

Alt Text: “Red alert: Olivia Wilde showed off her shape in a red dress as she attended the Golden Globes beside her fiancé, Jason Sudeikis, in Beverly Hills on Sunday”

Alt Text: “View of a ‘peace wall’ built between Catholic and Protestant neighborhoods to prevent violence”

The key takeaways for me are how high level these descriptors are. The first example has no low level descriptors of the geometries and colors, but instead advertises the IP corresponding to the image. In the second example, there are again no low level descriptors of the geometries and colors, just the names of the celebrities. The final example gives a purely historical description, saying nothing about low level descriptors of the geometries and colors. The salient feature of text-to-image generators comes from associating high-level descriptors (text) with an image. This is powerful, but quite different to how an artist usually works. Many artists work in a reductive manner where they break an image down into a series of abstract geometries, perspectives of those geometries, color. Artists compose images with these abstractions in mind. Painters cope with complexity by, for example, breaking down a face into a series of ovals and curves. Abstracting out the cultural aspects of the image reduces unnecessary details, allowing the artist to feasibly translate the face into some hand-painted image. The painting can be further refined to add back in/modify the semantic aspects of the face. This is a bottom up process.

Text-to-image generators are more top down in nature. The text pairs of the training data don’t describe ovals and curves, but a person. Thus, when giving text input, it’s best to start with text semantically relevant to a layman. Do not attempt to control perspectives, think about what culturally relevant images are associated with particular perspectives. For example, if you would like a “top-down” angle of some scene, don’t ask for a “top-down” angle, but instead something like “drone photography” or “selfie”. Use a high level descriptor that is stereotypically “top-down”. Instead of reducing your desired image into a series of geometries, reduce it to a series of cultural stereotypes. Call this “cultural reduction”, in contrast to “geometric reduction”. The art of text-to-image generation is the composition of image stereotypes into some novel image. Understand a wide variety of stereotypes within our internet culture.

Text: black metal; michael bay's transformers; super 8; 2pm; closeup of bodies

Text: children drawings of god; cave paintings; super 8; first person shooter

Text-to-image generation is a noisy process. In other words it is nondeterministic, which makes it quite difficult to have a consistent result. From the perspective of traditional editing, this is an awful process for designing images, yet with a slight change in perspective and expectations it is powerful. Instead of optimizing for an exact image, we should instead optimize for a distribution of images. The distribution is, for the lack of a better word, a vibe we are going for. A set of conditions that do not create one specific, perfect image, but instead generate an infinite amount of images that capture the essence of an idea.

How do we optimize our distribution? This is where the best aspect of Midjourney comes in. The “variation” feature. Given an output of four images from a text prompt, we can choose to further pursue one of the output images by generating four more images based on the chosen image. “Variation” uses an image from the last distribution, applies a bit of noise and outputs four images from a slightly different distribution. If our initial text prompt gives us some initial distribution, iterative applications of “variation” gives us a way to fine tune this distribution using preference, rather than language (if we would like to more drastically change our distribution we should modify our text prompt). It allows us to explore pockets of the text-to-image generators' state space. We can stick to a cohesive aesthetic, but find results more pleasing to our eyes and tastes.

An example:
Text: children drawings of god; cave paintings; overexposed

In conclusion, when designing text prompts, do not think in terms of geometric reduction, but in cultural reductive terms. Think nondeterministically, think in terms of distributions. Precise control will fail, instead build up methods for guiding.

Posted: **Tue Oct 25, 2022 11:44 am**

In the past three weeks, I explored MidJourney’s ability to generate three categories of images: realistic and conventional, abstract, and imaginative. I found that MidJourney is not the best tool to generate real and conventional images since a lot of the fine details are missing from the resulting images, making them not realistic. For abstract images, MidJourney would come up with a decent “understanding” of the text query and generate images that surprise me. For imaginative images generation, the MidJourney is capable of blending real and abstract and come up with satisfied results. However, I also found it difficult to control precisely with text descriptions.

Here are some of the example text queries I used:
Text query: starry sky viewed on the top of a volcano, photorealistic
Text query: starry night with galaxy that stretches to infinity, lava erupts out of a volcano, photorealistic
Text query: a man in red jacket skiing down the mountain in high speed, there are pine trees, photorealistic

Link to first week's images: viewtopic.php?f=86&t=363

In the first category, I was exploring MidJourney’s ability to generate natural scenes and humans. Starry nights and volcanoes are common things, everyone knows ahead of time what they would look like. From the results MidJouney generate, we can see that the software is able to make sense of the elements appearing in the text and put them together. However, the resulting images have a high noise level – there are random bright pixels flying on both the natural scene and skier image. Moreover, fine details that would make the images photorealistic are missing. In my exploration, the boundary between different objects in the same image is very blurry, as evidenced by the sky-volcano boundary and skier-snow boundary. It gives me a feeling that MidJourney would first generate each element in the text query and then later put them together to fit the text description.

In short, for real and conventional image generations, MidJourney is able to make sense of the text query. However, the resulting images, at least in my exploration, are far from what the keyword “photorealistic” describes. Being a CS student and a user of generative AI tools, I think realistic and conventional images are always the hardest for AIs to generate since we, as humans, have good expectations of what comes out and any deviance from those expectations would make the images unrealistic.

For the second category, I went in a completely different direction – abstract. In this exploration, I want to see how MidJourney would make sense of the abstract text description and generate images that we don’t know what look like ahead of time. Here’re some of the text queries:
Text query: starry sky viewed on the top of a volcano, photorealistic
Text query: starry night with galaxy that stretches to infinity, lava erupts out of a volcano, photorealistic
Text query: a man in red jacket skiing down the mountain in high speed, there are pine trees, photorealistic

Link to the images: viewtopic.php?f=86&t=364

The text queries I chose are abstract in nature. The resulting images, correspondingly, are also abstract. There are no elements that specifically reference forces, gravitational fields, or time and space. Instead, MidJourney utilizes simple geometries and variations in colors to create a “vibe” that fits the text description, which is completely different from the first category. I think this tells something about how MidJourney goes from first understanding the text query to later generating the images. When the text query is abstract, abstract representations in the images are used.

One more thing I noticed when exploring the second category is how varying the adjectives can impact the image generation. The words that reference the “style” set the tone of the images. It also tells me that using high level text descriptors is more useful than giving specific detailed descriptions.

For the next exploration, I did something in between – to text MidJourney’s ability to generate images that are imaginative but not totally abstract.

Here’re some of the images I generated with their corresponding texts.

Text query: tree house

Text query: tree house, realistic

Text query: realistic, tree house, a crooked path

Text query: photorealistic, sophisticated tree house in warm lighting, sourrounded by flowers

Text query: realistic, tree house, a crooked path, viewed from bottom

Text query: realistic, tree house, a crooked path, birdview

I don’t get to see a lot of tree houses in my daily life, so I was really excited about what MidJourney could come up with. From this set of tree house images, I was amazed by MidJourney’s ability to generate imaginative images. Some of the tree house images it generated present very fine details – much like the real tree house images I can find on Google. Yet some of the images are not realistic. Some have a very fragile bottom, making it impossible to sustain the weight. I really like how generative and imaginative those tree house images are. They look like kids’ drawings.

In the end, I proceeded to explore the effect of viewing angles. By default, the images MidJourney generates will have a right-on angle, much like someone is taking a photo in front. I tried keywords “birdview” and “viewed from bottom”. I think in general those words are effective in the overall image generating process.

In conclusion, I tried 3 types of image generation with MidJourney. In short, MidJourney is not good at generating photorealistic conventional images. However, MidJorney excels at generating imaginative and abstract images. In addition to that, the keywords that reference the style or viewing angle really made a huge impact. When referencing something in the text query, it’s better to give a description of the object or concept you want rather than directly giving the name. In one sentence, show don’t tell.

Posted: **Tue Oct 25, 2022 12:45 pm**

From my past experiments, I realized strict control with predetermined expectations often leads to failure in MidJourney. My assumption is, MidJourney should be used as an inspirational tool rather than a production assistant in the design process - it helps the author to formalize ideas, brings his concepts to the perceptual level of his consciousness, and allows him to grasp them directly, as if they were percepts. The author should leave creative space for MidJourney to invent geometry and color, and create instrumental guidance from a higher level.

Strict control through parameters is often executed after the evaluation of the initial result, but will not contribute to the first round. So the main question is: how should I describe something without knowing what it is to MidJourney?

I know an ancient Chinese battle scene emerged in my mind, and I have a blurred impression of its atmosphere, but I do not know the details consist of this image.
Prompt: traditional Chinese painting, tang dynasty battle scene, 652 ac

This gives me a form of my original idea. Although it is abstract, it formalizes the atmosphere in my imagination. And I like the battlefield not being specific at this moment, as I'm not sure whether it should be a realistic battle scene. This image will be used as a prompt for the following.

Prompt: https://s.mj.run/ISejA0xPAPY traditional Chinese painting, tang dynasty battle scene, The Eighteen layers of Chinese Hell

Prompt: https://s.mj.run/ISejA0xPAPY traditional Chinese painting, tang dynasty battle scene, Chinese Hell, punishment --no frame

the style has been stabilized, I will use this image as a prompt for following and build a narrative based on this.

Prompt:
https://s.mj.run/vlH7zAeMEr0 traditional Chinese painting, tang dynasty battle scene, Chinese Hell, godzilla fight transformers

I started with an idea that appeared in my mind but did not leave a clear image, and I struggled with describing it's form. MidJourney was useful in the way it build up the style and invent the approach towards to last image. The text prompt should not act as a superimposition upon the image, but should be a redirection of the image's evolution. The formalization of the ideas comes from the approach MidJounery build along with text prompts, not from a top-down commend.

Posted: **Mon Dec 05, 2022 10:47 pm**

We used Midjourney for the first three weeks of class, and I've noticed certain qualities of the resulting images.

Midjourney seems to utilize the whole canvas for placing visual elements. Interestingly, almost none of the images generated by Midjourney has empty compositional space. The empty spaces I noticed so far are frames and white gallery walls, which statistically are clean and without any content. Midjourney even puts some writing on framed images' backboards (see Jack's cave painting explorations). I noticed that using a low-quality parameter (for example, --quality 0.25) to force image generation to stop at an earlier phase helps with creating this empty space. Still, this approach tends to result in less complex foreground objects.

Additionally, prompt content changes the color scheme dramatically. If the prompt belongs to a concrete concept (like grass, water, red car, etc.), the image delivers this object correctly. However, when the idea is more abstract (like spatial hypergraphs, symmetry, and asymmetry), Midjourney gives orange-blue image outputs. At first, I liked this color scheme, but it got boring quickly. If this is not desired, --no orange, blue parameter removes the tendency to create these colors.

Some results from Jiarui's abstract explorations:

So far, I have systematically explored many global parameters that the Midjourney exposes to end users. These parameters are suitable for controlling specific aspects of image generation (for example, chaos parameter affecting the frequency of initial noise of the image or image weight determining the importance of an initial image). However functional these parameters are, they still don't cover large enough terrain to help discover our desired final image.

Zooming out from specifics of the image output, prompting the desired image with its every detail, seems impossible. As Jack has pointed out, the alt text of images in the training set doesn't describe every element in a given image. Since this is the case, our prompts need to include the concepts rather than the objects in the scene, or in other words, the general vibe of the scene than an itemized portrayal of everything in the composition.

Text prompt: electrified slime molds --ar 16:9 --seed 2039

When we work in this fashion, we sign up to give up the precise control of an image composition by accepting the collaboration between AI and humans. In this sense, Midjourney is more akin to a surprise machine than a digital creative butler. Memo Akten points out that the distributed consciousness dreams the concepts we asked the device to imagine -- we can only discover serendipitous images amid a vast landscape of AI-generated imagery.

Tiny evolution trees on two variations of the above image:

As we discover, we click on variations -- as we click on variations more and more, a sort of evolution takes place. A similar evolution took place in 1993 with Karl Sims' exhibition on Genetic Images, where audience members picked variations of visuals generated by genetic programming.

It is possible that these types of techniques will challenge yet another aspect of our anthropocentric tendencies. We have difficulty believing that we ourselves were not designed by a god but arose by accident via natural evolution. Similarly, we may find it difficult to believe that artificial evolution can compete with our design abilities and perhaps even surpass them.

Posted: **Tue Dec 06, 2022 2:23 pm**

In this week, instead of practicing MidJourney, I reviewed the history of AI image generators. The slides cover the historical development of AI image generators before diffusion models.

Development of AI image generators.pdf: (14.02 MiB) Downloaded 63 times

Phase 0: Deep Dream and optimization-based Style Transfer
The image generation is an iterative process, the input photos' pixel values are updated under the guidance of a loss function. The loss for Deep Dream is to maximize the activation on certain neurons. The loss for style transfer consist of style loss and content loss. Style loss measures the similarity between the feature maps in style layers of the result image and the style reference image. Content loss measures the similarity between the feature maps in content layers of the result image and the input (content) image.

Phase 1: Both Autoencoder and GANs constructs a CNN model that maps a vector into an image. In other words, it creates a mapping from a vector space to an image space. The vector space is so-called "latent space". Animation were created by performing "latent walk", interpolation vectors from one known vector to the other known vector and create intermedia photos between the photos corresponding to the two known vectors.

Phase 1 side track: improvement on the optimization framework: attaching a procedural image generator in the foremost of the framework and perform the optimization in the parameter spaces instead of pixels space. It enables SVG creation and a different look than traditional CNN-generated images

Phase 2: StyleGAN2 uses multiple methods to improve the fidelity of reconstruction results significantly. By shuffling the traditional latent space to a more organized latent space, which is called style space, it made possible style controls such as changing facial expression and hair style on the generated human faces. The same model was used to generate high-fidelity variants of artworks, landscape photos, manga characters etc.

Phase 2.2: before the dawn of the diffusion model was popularized, people are expanding the usage of StyleGANs to multi-model conditional image synthesis. All kinds of encoders were invented to map the various user control conditions in to the style space vectors, then decoded by StyleGAN3 as images.

Media Arts and Technology

wk04 - MidJourney 3rd + Analysis Report

wk04 - MidJourney 3rd + Analysis Report

Re: wk3 - MidJourney Imgs + Analysis Report

Re: wk3 - MidJourney Imgs + Analysis Report

Re: wk3 - MidJourney Imgs + Analysis Report

Re: wk3 - MidJourney Imgs + Analysis Report

Re: wk04 - MidJourney 3rd + Analysis Report

Re: wk04 - MidJourney 3rd + Analysis Report