wk10 - DALLE-2, Final Review Dec 1, 2022

Post Reply
glegrady
Posts: 203
Joined: Wed Sep 22, 2010 12:26 pm

wk10 - DALLE-2, Final Review Dec 1, 2022

Post by glegrady » Fri Sep 16, 2022 8:10 am

wk10 - DALLE-2 Final Review, Dec 1, 2022

Your review document should include thoughts about the following:

1. A look back at your whole exploration task. What did you expect the text2img tools to be like? What did you want to create with these tools at first, and how your purpose/topic evolved? How did the tools promote this evolution?

2. Analysis from an HCI point of view of how different features in text2img tools shaped your way of thinking and creating? For instance, how your creating process differed between using Dalle2, Mid journey, and Stable Diffusion?

3. A cross-comparison among these tools from their creating style, composition, parameter controls, and iterative creating. And whether there are MIX & MATCH collaborative working possibilities among these tools.

4. Discussion from the authorship point of view about copyright and creation. How to give your aesthetic to the tools instead of taking style out of the tools?

5. Exploring the future forms and potentials of AICG (AI Content Generation). Bring the idea from text2image generation to other creative domains you like and try to design a possible workflow (for example, text2music generation, the pipeline might be to convert text to image, then image to sonification).

The one-on-one meeting is on Nov. 29, the presentation is on Dec. 1st, and the due date for the written report of your final presentation is on Dec. 6th.
George Legrady
legrady@mat.ucsb.edu

tinghaozhou
Posts: 6
Joined: Mon Sep 26, 2022 10:24 am

Re: wk10 - DALLE-2, Final Review Dec 1, 2022

Post by tinghaozhou » Thu Dec 01, 2022 12:38 pm

What is the aesthetics of computational images? The class we are in posed this fundamental question right at the beginning of the first session and I have been pondering this question throughout the whole quarter. When we say the “aesthetics” of anything, what does it mean? What kind of premise does it crystalize? We might start with a familiar realm, by asking “how does the image look?” This question is preliminary but no less foundational as the word aesthetics is first and foremost associated with the formal or stylistic constitution of an artifact. A computational image is an artifact as such, though generated by the AI itself, still worthy of an in-depth formal analysis. This kind of formal analysis, however, is a transmedia practice—it not only attends to the visual realm of the perception (the image itself) but also, equally importantly, the communication system we adopt to interact with the AI (the prompt). Seeing our formal analysis as a kind of transmedia practice is to remind us that the text-image generation is not a direct translation process but more of a transfiguration one, with an internal learning process whose learning mechanism has not yet been fully understood (by us at least).

Take MidJourney for instance, I believe it’s one of the more user-friendly platform that actually welcomes our probing into how the learning process works as its interface gives clearly template of how we can “communicate” with the AI. First and foremost, we can start with a prompt I put in: “Wangechi Mutu, low skyscrapers, buildings carpeted with grass and smiley-faced flowers, pollens, animations from the early 1900s, close up.”
Vislab-MAT1_Wangechi_Mutu_low_skyscrapers_buildings_carpeted_wi_67862712-1127-4326-bf5b-2cdf64e985c9.png
Vislab-MAT1_Wangechi_Mutu_low_skyscrapers_buildings_carpeted_wi_b7b16737-b0d0-4c78-821e-dc5fd7adc73f.png
Again, as I analyzed in my report, the images I got represented a form of consistent aesthetics (if we are doing a stylistic analysis right now): the style of color pencil drawing remained across multiple generations, largely thanks to the “animation” prompt; the color pattern is similar and there are the similar eye/flower objects in the generated images, attributed by the style of “Wangechi Mutu” I assume. The consistency of the aesthetics and generations on MidJourney surprised me as, if we it compare with other bots and platforms we experimented with, this level of consistency is a rare found. Even with the even more detail-oriented and more straightforward platform of DALL-E, I could hardly find a similarity like this, not to mention the other Stable Diffusion tools we used.
DALL·E 2022-12-01 09.36.42 - a pregnant woman, a robot home, pollens, animations from the early 1900s.png
DALL·E 2022-12-01 09.37.57 - Wangechi Mutu, a pregnant woman, a robot home, pollens, animations from the early 1900s.png
DALL·E 2022-12-01 09.39.06 - afrofuturism, a pregnant woman, a robot home, pollens, animations from the early 1900s.png
DALL·E 2022-12-01 09.38.18 - Wangechi Mutu, a pregnant woman, a robot home, pollens, animations from the early 1900s.png
DALL·E 2022-12-01 09.37.00 - a pregnant woman, a robot home, pollens, animations from the early 1900s.png
A formal analysis, the attention of the styles and details, now prompts us to see that there are certain mechanisms behind different AI platforms that render possible the argument that it is very much a machine learning process but with a strong human/designer tone. Firstly, MidJourney is designed to be a more interactive platform that fashions the function of modification, scaling, and further deep learning by possibly incorporating users’ interactive attempts as learning orders. Secondly, in comparison with other platforms, MidJourney attends more to the so-called “artistic” dimension of the image instead of pursuing for a total accuracy (l found out that DALL-E created the most detailed images that are “loyal” to the prompt I put in, but MidJourney, however, seemingly tended to the style at large but not the details of the prompt, creating an atmosphere for the image but not a narrative per se).

But a formal analysis can tell us something else. If a computational image is generated through a process of learning the input data sets, then could we consider the computational images as an archive of systematic or, as George poignantly pointed out, community biases? I want to use one of Jack’s generated images as an example as I think this is very much a haunting image of contemporary American society. In one of his images, generated with a prompt “first person shooter in a catholic church; 4k; ultrarealistic; protestors in a strip mall,” we see a collage of seemingly newsreel shots. Without showing any actual faces, the image display shocking details of costume and scenes of possible religious conflicts and gun violence.
1667337245_seed_3363458392645591595_upscaled_2_sharpened_2.png
In the upper center of the image, we see a male figure clearly wearing Arab thobe with the keffiyeh on his head; behind him is a soldier figure holding a sign board with the America flag on his back. On the bottom part of the image, a man clearly holding a machine gun against an unknow target. This constellation begs a significant question: which part of the prompt could possibly trigger an image production like this and what does it actually imply? Who would consider a catholic church as a target, for example? Who would be protested against by the people on the street? What kind of sociopolitical, religious, and national anxiety is being generated in this one image? Essentially, did the AI simply answer this question in a brutally straightforward way: what do we, as a society, associate the Islamic culture with if we see people in thobes and keffiyeh? Then, does the machine pinpoint a different form of aesthetics, a kind of politics that respond to a pressing social mentality?


Building upon ancient Greek political philosophies, Jacques Rancière diagnoses that the operation of modern politics is actually based on a process called “the distribution of the sensible.” The “distribution of the sensible,” I believe, is an essential mechanism that delimits the different extended spaces in which sensory perceptions can take form and assigns specific political roles to different social actors based on the distribution. “This apportionment of parts and positions,” as Rancière explains, “is based on a distribution of spaces, times, and forms of activity that determines the very manner in which something in common lends itself to participation and in what way various individuals have a part in this distribution.” To put it simply, in a political organization, access to particular forms of sensory experience and knowledge is determined by specific social roles, which in turn determine one’s political position in the organization/community. The potential of politics, for Rancière, locates in the aesthetic regime of the art. For him, artistic practice is a “displaced form of visibility,” which renders possible the “redistribution of the sensible” and thus the disruption of predetermined political order and hierarchies. If we consider machine’s image-making partakes in the political project of “redistributing the sensible,” then we might be able to see that the text-image generator AI is a strong tool for the social and political analysis.

lu_yang
Posts: 9
Joined: Mon Sep 26, 2022 10:23 am

Re: wk10 - DALLE-2, Final Review Dec 1, 2022

Post by lu_yang » Sun Dec 04, 2022 7:59 pm

Dalle2 Experiments
My previous tests failed on using long, ambiguous text as prompts, so this is a continue exploration in this direction in Dalle2.

Following texts are excerpted from Lebbes Woods' article "CELEBRATING DEATH"
https://lebbeuswoods.wordpress.com/2012 ... ing-death/
Prompt: Hangar 17 at the JFK International Airport in New York City contains some of the strangest objects we might expect to encounter under the description artifacts. Twisted steel beams; battered and burned cars and ambulances; odd personal items bearing the traces of violence; items from a mall once lively with customers but no more—this is the stuff of many possible memorials to the 9/11 terrorist at
1.png
2.png
3.png
4.png

Following texts are excerpted from Lebbes Woods' article "TIMESQUARE"
https://lebbeuswoods.wordpress.com/2008 ... imesquare/
Prompt: The abandoned New York Central railyards, between 57th and 72nd Streets on the Hudson River, were—in 1987—one of the last remaining Manhattan sites ripe for large-scale commercial and residential development. Unlike the 34th Street railyards, which were and remain active to the present day, the vast Upper West Side property had been unused for many years. Or almost. At the northern end, at the ent
5.png
6.png
7.png
8.png

Following texts are excerpted from Kevin Stewart' article "Ch.4 Lebbeus Woods: Conflict and Space" where he descrips Woods' work and intention.
http://www.kevstewart.co/blog/2018/3/27 ... beus-woods
Prompt: A sweeping structure seems to come alive with energy as if activated by the raw power of an earthquake. The scale of this structure is especially emphasized when it is juxtaposed against the existing urban fabric, which is much more restrained in scale
9.png
Prompt: The structure is a highly articulated assemblage of surfaces that takes on a monumental quality as a whole. Beyond mere dynamism, this structure exhibits moments of fantastic structural acrobatics, arcing in a great span across an expanse of water. The individual pieces are derived from a similar tectonic logic and seem to self-generate their own logic of assembly
10.png


Summary

My exploration was focused on text-to-image AI as a design assistant tool. My original approach is to start with a certain goal or expectation of what I want to produce, then use AI to iterate towards expectation. However this is unsuccessful for:
1. AI does not fully interpret prompt intention, especially abstract concepts or specific terms that don’t belong to its training dataset
2. It is extremely hard to control the regional composition within the image. Outpainting and inpainting provide portional modification of the image, but face the first problem.

My later approach started with a blurry concept that has not been formalized from the beginning. Rather than a production tool to generate formalized ideas, think AI as a tool to formalize the author’s ideas, brings his concepts to the perceptual level of his consciousness, and allows him to grasp them directly, as if they were percepts. The author should leave creative space for AI to invent geometry and color, and create instrumental guidance from a higher level. For this approach, the problem faced by first approach can be solved by:
1. Structure the prompt based on cultural stereotypes or relation between search keywords and images on the internet, then use AI to formalize the idea rather than meeting the expectation.
2. Use iterations in Mid Journey or customized training method in Dreambooth on Stable Diffusion to invent compositional style rather than supurimpose some consolidated concept from the author.

Here an interesting question arises, who is the author of AI generated work?

In my opinion, there are three groups involved: meta-system designer, system designer and user, among whom system designer is more likely to own the ownthruship:

• Meta-system designers are those scientists, engineers and developers who architect Text-Image models stable diffusion that produce generative models that produce images.
• System designers are those technical artists who design generative systems from the metamodel through data training and feature modification.
• Users are those artists who design their prompt to get desired images and do not deal with system modifications.

In most cases, system designers have the most capacity of controlling the generated results, and during the design process, they have anticipated and engineered the connection between user’s prompt and the image it creates. To own the authorship of AI generated images, authors should be the dominant power in the design/generative and quality control process. It is true that users can control their work through prompt, but system designers apparently have more control over the results by manipulating the generative system.

Design Tool Application

In the Introduction To Game Development course I’m TAing, most students are from computer science backgrounds yet has to produce a highly hybrid work of tech and art. This is where Text-Image AI come in and become a great art tool.

Here is one example that demonstrate a student use images created by stable diffusion to assist his narrative of the game. As a indie developer, he has to focus more on the game mechanics and the programming side, at the meantime, AI provides a great opportunity to visualize his story despite the inconsistency between these images and the style of his game. This proves my second approach stated from the beginning: these images should be treated as inspirations rather than final products. When AI struggles to formalize some intangible ideas flow though someone’s mind, we are able to perceive and interpret, then we can recreate and craft.

https://youtu.be/NhyyM_0g-FI
Credit: Tianrui Hu, “Operation Chronos”, ECE 194M - Introduction to Video Game Development, Fall 2022, Instructor: Dr. Pradeep Sen

jiarui_zhu
Posts: 7
Joined: Mon Sep 26, 2022 10:25 am

Re: wk10 - DALLE-2, Final Review Dec 1, 2022

Post by jiarui_zhu » Mon Dec 05, 2022 1:20 pm

This report consists of two parts: my DALLE-2 exploration and my final report.

First of all, the DALLE-2 exploration.
I followed the same pattern as I explored the MidJourney and Stable Diffusion in the previous weeks. I am mainly interested in DALLE-2's ability to generate high-quality photorealistic images as well as abstract images. In the same time, I want to see how the text prompt is understood and translated by the software.

At this prompt, you know one of my favorite is the tree house generation as tree house is a combination of both nature and culture. When people think of the tree house, there is some expectation but also leaves a lot of imagination.

Prompt: A photorealistic tree house on a mountain, a crooked path leading towards it
Here's a list of images I got:
DALL·E 2022-12-05 12.13.06 - A photorealistic tree house on a mountain, a crooked path leading towards it.png
DALL·E 2022-12-05 12.12.58 - A photorealistic tree house on a mountain, a crooked path leading towards it.png
DALL·E 2022-12-05 12.12.51 - A photorealistic tree house on a mountain, a crooked path leading towards it.png
DALL·E 2022-12-05 12.12.43 - A photorealistic tree house on a mountain, a crooked path leading towards it.png
At first glance, the resulting generation is not as photorealistic as the Stable Diffusion from previous weeks. The images are more of a painting style rather real photos. And in some of the images, the surface of the house and the stairs are a little blurry. They all lack the fine details that make the image photorealistic.
From the prompt understanding point of view, I feel like there is a misunderstanding. The DALLE-2 understands "tree house" more as a house with a tree, not they are together. I think at least in the treehouse generation, both MidJourney and Stable Diffusion have a closer understanding of what "treehouse" means.
However, one element that surprises in the final generation is "the crooked path leading towards it". If you take a look at my previous generations with MidJourney and Stable Diffusion, you would find that the "crooked path" is always not very well represented in the images. However, with DALLE-2, I found the "crooked path" very well naturally incorporated in the image.
Given that the crooked path is a very detailed description within the text prompt, I find that DALLE-2 is better with long descriptive text prompts.


Then, I tried DALLE-2's unique image edit feature. To use the image editing feature, all I need to do is to choose an image, put the part where I want to edit in the generation frame, erase some part of the image, and tell DALLE-2 how I want to edit the part with text.
For this one I choose to set up a fire on the treehouse.
DALL·E 2022-12-05 12.23.26 - set up a fire on the treehouse.png
From the image, I can see a fire inside the window. Although the fire is far from real -- it lacks the smoke and the distortion caused by the heat -- I am surprised by how easy it is to simply edit part of the image. I think this feature is particularly useful in the continuous development workflow of the image generation -- you can continue to edit it until you are happy.


I also find that using the keyword "3D render" would prompt DALLE-2 to generate high quality, detailed images.
Prompt: 3D render of a hairy husky sleeping on the grass
DALL·E 2022-12-05 12.28.05 - 3d render of a hairy husky sleeping on the grass.png
Although the intersection of dog fur and grass does not look very real, I am very surprised by how the fur looks in the image. I wonder if adding keyword "3D render" would change any image generation process.


Lastly, I tried some abstract prompt.
Prompt: everlasting, hyperdetailed, time and space distorted by huge gravitational field in the universe
DALL·E 2022-12-05 12.25.08 - everlasting, hyperdetailed, time and space distorted by huge gravitational field in the universe.png
DALL·E 2022-12-05 12.25.04 - everlasting, hyperdetailed, time and space distorted by huge gravitational field in the universe.png
DALL·E 2022-12-05 12.24.59 - everlasting, hyperdetailed, time and space distorted by huge gravitational field in the universe.png
The abstract prompt generations look very like what I got from MidJourney and Stable Diffusion. It looks like DALLE-2 is able to make sense of the abstract prompt and give a good representation of what the prompt means in the resulting image. The three images I got also have different colors and themes.

----------------------------------------------------------------------------------------------------------

Below is my report for the final report:

For my final report, I decide to do 1) an analysis on the HCI of different text2img tools and how different HCI shape the way I generate art and 2) possibilities and potentials of combining AI content generation in the VR/MR.

Let's start with MidJourney.
The image below shows the general interface of MidJourney (you probably need to click the image and enlarge it):
Snip20221205_39.png
It is host in the Discord and the text prompt is entered via a chat box. Once the text prompt is entered, it took about 30 seconds to generate the images. By default, each generation results in 4 images, and you can easily vary the image or upscale the image with a click of the button.
Snip20221205_40.png
MidJourney also provide advanced features to customize the image generation. For example, you can specify the stylize values, quality values, advanced text weights, etc. Those advanced features can be triggered with "--feature_name value" after the text prompt.
Snip20221205_41.png
MidJourney has a community show case, where you can find all the trending and top images generated by MidJourney. You can see the image, text prompt, as well as the user name of each generation. I find this features particularly useful for beginners to learn different effective text prompts. It also helps me get to know the limit of this software: what is possible and what is not.


Stable Diffusion:
For Stable Diffusion, I mainly use Lexica as the generation engine. The image below shows the general interface of Lexica.
Snip20221205_42.png
Lexica, unlike MidJourney, is hosted on a web page. People can use it as a web application without the trouble of getting a discord account. The text prompt is entered in the text box. And it has a Negative prompt to enter anything you want to exclude from the resulting image.
Stable Diffusion(Lexica) also allows some advanced features to customize the image generation. Users can change/add those features by checking the check box or adjusting the slide bar.
Users can also search the generated images by using the search feature.
Snip20221205_43.png
One interface that is unique to the Stable Diffusion is "Explore this style" feature, which allows you to see the results of similar generations/text prompt by other users as well as detailed parameters. I think it very useful to improve my own text prompt as I can always learn something from other people's generation.
Snip20221205_44.png
The material stable diffusion has a very similar interface to Lexica but has more parameters you can change.


DALLE2:
Snip20221205_45.png
DALLE is also hosted as a web application. It has a very simple interface where you can enter the text prompt in the text box and generate. However, unlike MidJourney and Stable Diffusion, I do not find a way to change the advanced features or customize the generation process. And it generally takes a little longer to generate the results.
What is unique to DALLE is that you can edit the image easily. I have covered this part in my DALLE exploration. Personally I find this feature very useful for the AI generated content as it allows a way to modify the result. Just like how the real artists can modify their art.


I think all the softwares above have good interfaces for the users to use. I would imagine a normal 10 year children can easily learn how to use them to generate images in a short amount of time. And I am very glad that I do not need to "be a programmer" to use those softwares.
Stable Diffusion(Lexica) and DALLE-2 have similar interfaces: web application with text box style prompt input, and there're nice UIs to use the advanced features. Personally I like them better than MidJourney because it separates the text prompt from any other features. I only need to focus on one thing at a time. In addition, the slide bar and check box UI are easier to use and more intuitive.

Personally, one big take away from using those text to image generation software is that I find another way to channel my creativity. I am not good at art -- I have never had art school, and I don't have enough time to sit down to make beautiful drawings either. However, those text to image softwares offer me another chance to make something looks good without professional knowledge or skill. All I need to do is to think about what I want in my head and enter the description as detailed as possible in text. Imagination is the only limitation.
When I took the 3D printing class MAT238 last quarter, I read that some people have different opinions towards the product of digital fabrication. They think 3D printing product as "inferior" because the creator usually don't have "what it takes" to create the final product. Instead, they give up to the machines and let machines to do all the work for them. I would assume similar arguments exist in the art world too. Soon there will be questions like "should we consider AI generated content as art" or "what it means to be a real artists". I am not here to answer those questions. However, as an end user of those softwares, I am definitely benefiting since it empowers me to do something I would never think about before -- generating fun/good looking/meaning images.

---------------------------------------------------------------------------------------------------

In the end, I want to talk about the future of AI generative content and explore the potentials and possibilities of text to image in the settings of VR/MR.
The metaverse has been a very popular concept recently. A metaverse is a network of 3D virtual words focused on social connection such as VR meeting, VR gathering, etc. With the help of a headset, you can have your own world.
The following link shows a video demo of part of my ongoing master research project: https://youtu.be/LgRG3vUbqZk
In the project, I implement an avatar who you can have natural conversation with. At its core, I use GPT-3, an AI generative language model, to generative the response. I think the text to image generation may have bigger potential here. A few applications I can immediately think of: virtual art gallery, decorate your own virtual space, virtual 3D applications monaverse. The generative images will allow more excitements and imaginations in the metaverse.
The following image shows the general workflow to interact with AI generative content in VR/MR.
1. It starts with the voice command to activate the general interface. I think the most natural way of input in the VR/MR is through voice. And there are existing speech to text methods, Voice SDK in Oculus for example, for us to use. We can also train a model to understand the vocie input and trigger appropriate functions.
2. Voice input text prompt as well as slide bar adjustment. The prompt can be input via voice command with speech to text. For some advanced features such as text weight and size of the resulting image, we can implement slide bar adjustment for the users to use.
3. Calling model API. Most of the text2img softwares have existing APIs for us to use. We can wrap the users' input and call the API for the image generation.
4. In the end, once we get the image from the model API. We can ask the users to do some post processing and place the image for use.

Having been working with VR/AR development in the Unity for a while, I believe the above general workflow is very possible for a successful implementation. Obviously, I made a lot of assumption, and the workflow could be different from platforms to platforms or applications to applications. However, this could be used as a foundation to explore the infinite potentials of AI generative content in the VR/MR.

My final presentation slide can be found here: https://docs.google.com/presentation/d/ ... sp=sharing
Attachments
Snip20221206_46.png
Last edited by jiarui_zhu on Tue Dec 06, 2022 9:31 pm, edited 2 times in total.

wqiu
Posts: 14
Joined: Sun Oct 04, 2020 12:15 pm

Re: wk10 - DALLE-2, Final Review Dec 1, 2022

Post by wqiu » Tue Dec 06, 2022 2:10 pm

Part #1 Comparison between conditional Image-to-Image translation between StyleGAN3+p2s2p and Stable Diffusion

Presentation slides:
Presentation MAT 255 F22.pdf
(7.39 MiB) Downloaded 55 times

I used stable diffusion as a module of a larger project. This module is in charge of translating a raw pose image into a photorealistic images of fencer in that pose.
Screenshot 2022-11-30 at 7.33.31 PM.png

It was achieved with StyleGAN3 model trained on cropped fencer photo, and a p2s2p style encoder trained on pose-photo pairs. Once the two models are trained, given an arbitrary pose, the p2s2p encoder can encode the pose image (stick-figure image) into a latent style code, which will be feed into StyleGAN3 to generate a cropped fencer photo.
Screenshot 2022-12-06 at 1.41.44 PM.png
This method suffers from the difficulty in training, and limited expressiveness of the model. This is inherently due to the GAN architecture cannot model complex images with much variability very well.

When diffusor model was introduced into solving this problem, I used text-guided image-to-image translation function provided by stable diffusion. With proper tweaking of the parameters and the text prompt, I can generate the results in my desired style, even though I don't have a dataset of images in that style.
Screenshot 2022-12-06 at 1.41.49 PM.png
Currently the diffusion model still suffers from the issues of consistency and speed. I hope to resolve them in the future.

Let's compare the two methods in the following chart:
Screenshot 2022-12-06 at 1.41.38 PM.png
Image Quality:StyleGAN3 can produces very good image quality, that feels near-indistinguishable to its training set of images, but artifacts may present when the condition, in this case the input pose, is too different to the common ones. Stable diffusion can create images of comparable quality but additionally, it perform better at extreme cases.

Limited Data: The StyleGAN3 heavily relies on the dataset for training. My dataset with a size of 4000 pose-image pairs was barely enough to train it. What makes the Stable Diffusion shine is it never requires any training dataset to use it. It can be further fine tuned with a few samples of images to make the model specialized at generating certain type of photos.

Consistency: All photos generated by StyleGAN3 shares a unified style, that is consistent with the training dataset. In this case, fencers in all generated fencer images wear the same fencing suit. However, for stable diffusion, it performed worse in consistency. Fencers worn different fencing suit in different images, even with the same text prompt and parameters. It is more challenging to maintain consistency in stable diffusion.

Style Control: for StyleGAN3 the only "style" control is the human pose. There is not a control for the color, composition, background, etc, because those styles are shared in common in the dataset and no alternative styles were in the dataset. For GANs, it works the better the dataset samples are similar than the dataset samples are different. Therefore, if I would need to generate images of another style, I would have to collect another dataset and train the model on it from scratch. It is time consuming. However, with stable diffusion, I can choose the style of the photo simply by changing the text prompt without retraining the model. It is due to the fact that diffusion model is able to model various image styles simultaneously and the datasets that was used to train stable diffusion contains all styles available on the internet.

Ease to Use: when using StyleGAN3 results for chronophotograph blending, I need extra step of background removal, which could either be done manually but precisely or automatically by semantic segmentation but roughly. It is also limited by translating the pose images one by one. For stable diffusion, it can translate the photo consists of two fencers in one step, and can create images without background directly. In this way, Stable diffusion simplified my workflow.


Part #2 answer to the prompt questions:

1. A look back at your whole exploration task. What did you expect the text2img tools to be like? What did you want to create with these tools at first, and how your purpose/topic evolved? How did the tools promote this evolution? 
- I expected the tool is fairly simple and I can describe my image as if I am talking to a human being. When I started to use the tool, I find it sometimes cannot understand my text prompt correctly, or it understand my prompt in an unexpected way. I adjust my description to fit the software better. For example, it may not care the grammar so that you can write a very long sentence with some grammar errors; you can also repeat and rephrase your description to strengthen that part in the final image. I think as I learn to engineering the prompt, I feel more control of the software outputs.


2. Analysis from an HCI point of view of how different features in text2img tools shaped your way of thinking and creating? For instance, how your creating process differed between using Dalle2, Mid journey, and Stable Diffusion?
- I think by trials and errors I kind of know the keywords that will change the look of the final result so that I can use them properly. I began to care less about the text prompt’s readability to human beings and care more about its effectiveness to the software. The tool shaped my way of producing text prompt. I also have to describe things in a more specific ways by adding a lot of details to the text prompt.
- With text2image tools, I find sense of achievement to produce something that is identical to what was in my head. Whereas, it is the opposite when using the past generative tools in which you are happier if it produces something surprising. The difference lies on the explicit levels of the tools. The traditional generative tools are easy to control (you kind of know what you are doing) but the text2img tools are hard to control (you don’t know what you are doing).


4. Discussion from the authorship point of view about copyright and creation. How to give your aesthetic to the tools instead of taking style out of the tools?
- I think the tool itself doesn’t have particular styles or aesthetics. You can change the final look by tweaking the text prompt. One tool might behave better than the other tool in terms of certain styles. For example, I think DALL-E2 is bad at generating faces, whereas Midjourney is better at it. With all three tools, most existing aesthetics/styles are already covered and can be produced fairly well.
- A potential way to give your aesthetics to the tools is to question the nature of the tool - synthetic high-fidelity photos. How does it challenges the photography? What is the value of these photo in communication? How to define the authorship of these photos? How to reconcile high fidelity and authenticity? By generating photos that explore these topics, it extends the meaningfulness of the results from being a purely visual artifacts to a cultural investigation and philosophical reflection.

merttoka
Posts: 21
Joined: Wed Jan 11, 2017 10:42 am

Re: wk10 - DALLE-2, Final Review Dec 1, 2022

Post by merttoka » Tue Dec 06, 2022 3:21 pm

Last week, I attempted to use Dall-E 2 to generate visual content for the novel Invisible Cities by Italo Calvino. My original idea was to generate one or two images for all 55 imaginary cities described in the book. However, this proved to be an unrealistic goal. First, Calvino's text does not depict the overall form of a city. For example, the city of Valdrada is described as a shoreline city on a hill with many structures reflected from the body of water.
Traveler, arriving, sees two cities: One erect above the lake, and the other reflected, upside down. Nothing exists or happens in the one Valdrada that the other Valdrada does not repeat because the city was so constructed that its every point would be reflected in its mirror, and the Valdrada down in the water contains not only all the flutings and juttings of the facades that rise above the lake but also the rooms' interiors with ceilings and floors, the perspective of the halls, the mirrors if the wardrobes.
To depict Valdrada, I tried following prompts.

Prompt: An ancient city on the shores of a lake, night lights reflect on the water, with houses all verandas one above the other, and high streets whose railed parapets look out over the water.
image (1).png
Prompt: An ancient city on the shores of a lake sunset, with houses all verandas one above the other, and high streets whose railed parapets look out over the water.
image.png
Prompt: An ancient city on the shores of a lake civil twilight, with houses all verandas one above the other, and high streets whose railed parapets look out over the water.
image (2).png
Prompt: An ancient city on the shores of a lake night, with houses all verandas one above the other and high streets whose railed parapets look out over the water.
image (3).png
I was impressed by how well Dall-e created the specular reflections on the lake's surface. All these images have a matching reflection with the surface, which was impossible to generate ten years ago. Even though Dall-e does an excellent job creating a city on the shoreline of a lake, none of these imagined cities feel like they are Calvino's fictitious Valdrada.
image (4).png
I tried another city in the book, Despina, that sits between the desert and the sea. The silhouette of the city resembles a ship to a camel driver approaching from the desert, and similarly, it looks like a camel back to a sailor coming in from the sea. I tried to capture this duality by using the outpainting tool of Dall-e:
DALL·E 2022-12-01 11.41.58 - a sailor watching left from a ship.png
The image turned out to be more stylized than I imagined, but it is easier to work with since it involves a collage of elements. I didn't want to work with realistic imagery in this example. To create this, I started with the middle skyscrapers and worked along the horizontal axis to make the desert and the sea portions by changing the prompts.

I believe the panoramic quality of this outpainting result is somewhat successful as it can depict the scenery from the novel to a certain degree. I also ran a version of these as individual images:

Prompt: the camel driver watching over a distant city at the horizon from the sand dunes, a shoreline city that looks like a ship with the pinnacle of skyscrapers, a radar antenna, white and red wind-socks, chimney smokes. Italo Calvino.
image (5).png
Prompt: the sailor watches a distant city from the ship, desert background, and shoreline city that looks like a camel with skyscrapers, radar antenna, white and red wind-socks, and chimney smokes. Italo Calvino.
image (6).png
Even though Dall-E's default style seems realistic, something in these prompts some cartoonish quality to the output. I decided to use the name of the city only to see what images I would receive:

Prompt: City of Despina
image (7).png
Here we see a realistic generic city landscape with tiny houses spread outwards. The Despina keyword seems to be useless here.

Prompt: Despina
image (8).png
Using just the name of the city produces oddly distorted faces of women. This was surprising to me initially since I was invested in Calvino's novel, but when I searched the internet to see what Despina is, I noticed a beauty contest winner named Despina.

Similarly, one of the small moons of Neptune is called Despina. So running the prompts with the moon should produce images of celestial objects.
Prompt: Despina (moon)
image (9).png
The results are still not referring to Neptune's moon, but a picture of a woman in front of the moon.
Prompt: Despina, Neptune's moon
image (10).png
Even in this case, where the Prompt clearly states that the moon belongs to Neptune, we see an imagined female figure standing before a moon reflection.

This tangent might not be too important in the case of Despina, the city, the woman, and the moon, but in the bigger picture, it shows that we are not working with the concepts we ask these systems to generate. Instead, we witness the model's hallucinations on the content we have put online.


===========


Looking back at my explorations, I can tell that I spent most of my time trying to understand the limits and affordances of these algorithms. This mainly took the form of trying the same Prompt with different parameters such as chaos, quality, guidance scale, etc. At this point, I was expecting more of a deterministic system where similar prompts always produce similar output. However, I soon noticed that this wasn't the case. The relationship between the output image and the Prompt is not as straightforward as the semantic meaning of the prompt. Instead, it is contingent on the cultural implications that live on the internet.

I haven't explored the variations too profoundly in any system yet. The furthest evolution tree I have created is not too deep, resulting in shallow mutations in the offspring. As we advance, I would use the generating variations to arrive at surprising forms that evolve to my interest in genes.

Regarding interaction, I think the HCI aspect of these tools significantly impacts their adoption. The fact that Midjourney runs on Discord is a little cumbersome for us -- people who are used to computer software-- yet it enables a whole different population to access these tools on their smartphones. Many Stable Diffusion-based web interfaces and the Dall-E website also allows similar audience who are not comfortable using terminal on their computers. Also, these models are taxing on computational resources, making it hard for widespread adoption because of the requirements of GPU computing power and VRAM.

An exciting interface that stood out to me was the infinite canvas in Stable Diffusion (or Outpainting in Dall-E). This way of working has great potential since it allows local edits and expands the image as much as we want. In addition, this method could shine by mixing and matching the results of Midjourney, Stable Diffusion, and Dall-E in the same environment.

The future of AICG is exciting and intimidating. By the time we started exploring these algorithms in this class, there were significant breakthroughs every week. We have witnessed text2video, text2imageEditing, text23D, and many more in 2 months. This is exciting because it enables new ways of working with previously established disciplines and potentially improving the field altogether. For example, in my 2nd Stable Diffusion post, I posted an example that uses a Stable Diffusion-generated texture on the ceramic vessel. It uses a seamless texture generated with Material Stable Diffusion to control the surface texture of my ceramics object. It's exciting to find some explorations between image generation and clay printing.

jkilgore
Posts: 7
Joined: Tue Mar 29, 2022 3:34 pm

Re: wk10 - DALLE-2, Final Review Dec 1, 2022

Post by jkilgore » Tue Dec 06, 2022 11:13 pm

Empty Textbox
Jack Kilgore, December 2022

I was handed a textbox. Not sure by who. I could type in anything and receive a picture back. There was nothing I particularly wanted. I just typed in phrases I plucked off the top of my head. I typed in “Joe Biden fighting Mickey Mouse”. What I received were somewhat cohesive amalgamations of the pictorial associations of “Joe Biden” and “Mickey Mouse”. Two vague bodies entangled, not obviously fighting, sometimes cartoon, sometimes photographic. A haphazard mixture of pop culture stereotypes relating to “Joe Biden”, “Mickey Mouse”, and the two “fighting”. I typed in “a black metal concert outside in the winter”. Same thing. A stew of stereotypes. I typed in a lot of things. I felt nothing for the images I received.

At some point “first person shooter”, “Counter Strike”, and “Halo” fell out of my head; I typed them in. Same thing. Lowest-common denominator associations with the phrase “first person shooter”; but for whatever reason, these images struck me. Skewed perspectives, the essence of a hand and a gun in the bottom right corner, virtual environments full of crude player models in the distance, low polygon structures. The output I received told me to keep typing. I imposed places on my first person shooter prompt: “first person shooter in a McDonalds”, “first person shooter in a wide open field in winter”, “playing Counter Strike in the pits of hell”, “playing Halo in los angeles”. Some were kind of funny, but nothing hit me. I had no desire to iterate upon video games in fast food chains or in LA.

I typed in “first person shooter in a church”. I was given cavernous halls full of altars and priest soldiers, crumpled crosses covering walls, golden light thrown across the scene. I was elated. This is a cultural stereotype that resonates, not sure why, keep going. I typed in “angels playing counter strike in a church; first person shooter”. The machine filled the scene with distorted, winged 3d models covered in guns, helmets and limbs; pointing at each other, pointing at me. Frozen and running, covered in waterproof fabrics, wrinkled and muted, golden light everywhere. I iterated. I skated around this pocket of joined stereotypes, found its grooves, found its logic. I typed “angels playing counter strike in a church; first person shooter; hands first person”. Not sure why, but I do love hands. I also love the machine’s failure to understand hands. It simply saw a series of ‘n’ little stumps, curled around objects, folded into each other. The machine gave me more angels covered in guns, crosses, and limbs. But our angels began to hug each other, each angel a stump. They came in pairs of three fours and fives. They lost their faces, they lost their helmets; just clothed stumps, each becoming a finger. The fabric began to fall away from the stumped angels and replaced with bark and I was left with twisted sticks covered in dirt sitting in sunlight. Angels, soldiers, fingers, twigs--they’re all a part of the same continuum. The machine gave me the ability to see that. I feel good, the output looks good, enough for me to continue the process. I stayed in this pocket until I was bored. I don’t go on; I go about my day.
Vislab-MAT2_angels_playing_counter_strike_source_first_person_s_e14aac85-d914-4f7a-8033-7b6ae56109b4.png
Vislab-MAT2_hands_first_person_angels_playing_counter_strike_fi_eda0f84b-c354-4861-9d51-ca759c3cf216.png
Vislab-MAT2_hands_first_person_angels_playing_counter_strike_fi_1a82ee0c-1073-42ea-8c87-8f0f28c48ca3.png
A few days later I approached the machine. Again, I typed in phrases I plucked off the top of my head. But the machine affected me the other day. My mind gave me “first person shooter in a cave”. What I received did not strike me. The images were too similar to my last excursion. “It’s time to move on; it’s time to pivot; it’s time to get the machine to affect me again, like the other day, but with new output” I tell myself. I type in “children drawings; cave paintings”. I’m not sure why. What I receive are crude crayon drawings on stone full of scenes that tell me nothing, full of stick figures. The style strikes me, the machine has told me to iterate. I build a scene. I type “children drawings of god; cave paintings”. It’s not just style anymore. I see environments full of stick figure disciples binding with animals, watching crucifixion. A head with arms stretched out on both sides, held up by a line, or held by a cross. So many cyclops, giants, and elephants on stilts. I am filled with ideas of innocence, beautiful incoherent scenes that have the potential to grow and refine with time; I see myself from the other day. More ideas in my head, notions of feedback, still vague, no rationale. The only thing I’m sure of is this: “keep going”. I listened to the machine; I typed: “children drawings of god; cave paintings; first person shooter”. A bit more sure why this time. Again, I was given images filled with crosses and crucifixion, but now images of guns filled the wall, red obliterating a few disciples. The red felt violet. Yellow light superimposed on distorted crosses and figures. The yellow felt violent. I am elated. I suppose violence and god filtered through the lens of a child is low hanging fruit to stir a mind. Either way, I keep going until I get bored.
Vislab-MAT0_children_drawings_of_god_cave_paintings_super_8_c4746129-6e9d-452b-9ae0-65c47823f5a3.png
Vislab-MAT0_children_drawings_of_god_cave_paintings_super_8_fir_9555e106-a4ff-47e8-9cc6-c0a52e00a260.png
Vislab-MAT0_children_drawings_of_god_cave_paintings_super_8_fir_104e59bf-333c-4f24-a55a-60313813a095.png
On my last descent, I saw geometric relations between soldier angels, hands, and twigs. I was full of joy, I was compelled to keep going by some low-level dopamine hit to the head, like a sexual attraction. Basic and powerful. Today, I begin to have ideas, ideas deeper than aesthetic, deeper than geometries and texture. The machine, in its incessant output of crude drawings depicting biblical scenes and war games, told me to introspect. It told me to pay attention to why I am resonating, not just to keep pressing the button that makes me resonate. It tells me there is meaning beyond, past the body. I listened. I stopped typing and went on a walk.
mw2.PNG
I remember Call of Duty. That was a system I believed in. I convinced my parents to let me play Modern Warfare 2 on the Xbox 360. I filled myself with experiences of abstract violence; I loved war games. I loved ranking up, receiving camos for my weapons, prestiging, repeating the process. I wanted to get the final kill in a match, I wanted to quickscope, I wanted to do trick shots off of high buildings, I wanted the predator missile, then the chopper gunner, and then the nuke. This was a reality I could invest myself in. My free time was given to Activision for quite some time.
csgo.jpg
But then I found Counter Strike coming into teenagehood. A more refined game with a higher skill ceiling, a system that I could absolutely lose myself in. I couldn’t stop thinking about Counter Strike. I was obsessed with the system of Counter Strike, the ranks, the tactics, the community driven game modes, the gun skins, the professional world, the betting of gun skins, the online interactions with strangers. I spent the majority of time with my real world friends in the context of Counter Strike. In non-virtual life I was a ghost. I invested little time in finding meaning outside of the context of Counter Strike. All my energy was given to Valve. It gave me back structure, a place that encouraged iteration in the hopes of progress.
lord.jpeg
In parallel to these video game structures, I went to church every Sunday. I prayed every night before bed with my Dad. It was always pitch black, our voices breathily repeating “our father in heaven, hallowed by your name…”. We would then list out specific people to pray for; extended relatives, the sick and dying. At the time this was a habit put upon me by external forces. God, the cross, were symbols that made me feel nothing. I trudged through confirmation, apathetic. My mind was elsewhere: given over to Counter Strike.
I’m back from my walk. This text2image machine showed me aspects of my life that I have left unattended for quite some time. After iterations of it’s general stew of cultural stereotypes it led me to the specific, the personal. Those previous resonances brought on by the machine’s output brought to the foreground aspects of myself I would like to think about more, to deal with on my own. The machine told me to stop typing, it told me to go on without it. I listened. I began to draw.
Screen Shot 2022-12-06 at 10.54.53 PM.png
I drew with crayon and ballpoint pen screenshots of Counter Strike games from when I was fourteen. The act of drawing, the images I received from myself stirred memories even more specific than before. Specific scenes filled my mind. I would spend hours on private servers memorizing the spray patterns of guns. Emptying clips into the wall, watching where bullet 1 lands, bullet 2, bullet 3, bullet 30. An upside down ‘L’ shape. I would practice reversing that shape as we shot, offsetting the recoil, collapsing the spray pattern into a point. I could hold down the left mouse button and hit anything I desired. I could kill 5 enemies without pause. Spray control. I would also spend large chunks of my time practicing team tactics. Smokes and flashbangs. Intelligent placement of smokes to obfuscate enemies lines of sight. Well timed flashbangs to temporarily blind enemies. I would finish my homework and play into the night. I remember Tindall and Joker, the people I would play with. People I’ve never met before in person, people that I spent more time with than anyone else in school. Nothing else mattered. I was entrenched in a system that I believed in. Encouraged by the machine that is Counter Strike to keep going. To improve, to rank up, I was filled with meaning. My drawings began to show signs of God, of Christianity. I began to have thoughts connecting the two. Counter Strike, at the time of obsession, was my belief system. Some higher set of structures placed upon me, by the vague entity Valve. My Christian upbringing overridden by war games. God did not provide a structure that enticed my twelve year old self, but online competitive video games did. Still not sure what that means. Not sure what that makes me right now. Now I go on everyday drawing imagery of Counter Strike. I am perpetually filled with vague thoughts of being raised online, new religion, nostalgia, and violence.
the hacker 4.jpg
the hacker 5.jpg
the hacker 6.jpg
the hacker 7.jpg
Text2image is a mirror; the process of using the technology reflects back aspects of yourself, filtered through the greater culture. Cave in, let the software lead you around on a leash. Let the process thrash you around. If you listen to what it says you will discover yourself.

Post Reply