This report consists of two parts: my DALLE-2 exploration and my final report.
First of all, the DALLE-2 exploration.
I followed the same pattern as I explored the MidJourney and Stable Diffusion in the previous weeks. I am mainly interested in DALLE-2's ability to generate high-quality photorealistic images as well as abstract images. In the same time, I want to see how the text prompt is understood and translated by the software.
At this prompt, you know one of my favorite is the tree house generation as tree house is a combination of both nature and culture. When people think of the tree house, there is some expectation but also leaves a lot of imagination.
Prompt: A photorealistic tree house on a mountain, a crooked path leading towards it
Here's a list of images I got:
At first glance, the resulting generation is not as photorealistic as the Stable Diffusion from previous weeks. The images are more of a painting style rather real photos. And in some of the images, the surface of the house and the stairs are a little blurry. They all lack the fine details that make the image photorealistic.
From the prompt understanding point of view, I feel like there is a misunderstanding. The DALLE-2 understands "tree house" more as a house with a tree, not they are together. I think at least in the treehouse generation, both MidJourney and Stable Diffusion have a closer understanding of what "treehouse" means.
However, one element that surprises in the final generation is "the crooked path leading towards it". If you take a look at my previous generations with MidJourney and Stable Diffusion, you would find that the "crooked path" is always not very well represented in the images. However, with DALLE-2, I found the "crooked path" very well naturally incorporated in the image.
Given that the crooked path is a very detailed description within the text prompt, I find that DALLE-2 is better with long descriptive text prompts.
Then, I tried DALLE-2's unique
image edit feature. To use the image editing feature, all I need to do is to choose an image, put the part where I want to edit in the generation frame, erase some part of the image, and tell DALLE-2 how I want to edit the part with text.
For this one I choose to set up a fire on the treehouse.
From the image, I can see a fire inside the window. Although the fire is far from real -- it lacks the smoke and the distortion caused by the heat -- I am surprised by how easy it is to simply edit part of the image. I think this feature is particularly useful in the continuous development workflow of the image generation -- you can continue to edit it until you are happy.
I also find that using the keyword "3D render" would prompt DALLE-2 to generate high quality, detailed images.
Prompt: 3D render of a hairy husky sleeping on the grass
Although the intersection of dog fur and grass does not look very real, I am very surprised by how the fur looks in the image. I wonder if adding keyword "3D render" would change any image generation process.
Lastly, I tried some abstract prompt.
Prompt: everlasting, hyperdetailed, time and space distorted by huge gravitational field in the universe
The abstract prompt generations look very like what I got from MidJourney and Stable Diffusion. It looks like DALLE-2 is able to make sense of the abstract prompt and give a good representation of what the prompt means in the resulting image. The three images I got also have different colors and themes.
----------------------------------------------------------------------------------------------------------
Below is my report for the final report:
For my final report, I decide to do 1) an analysis on the HCI of different text2img tools and how different HCI shape the way I generate art and 2) possibilities and potentials of combining AI content generation in the VR/MR.
Let's start with
MidJourney.
The image below shows the general interface of MidJourney (you probably need to click the image and enlarge it):
It is host in the Discord and the text prompt is entered via a chat box. Once the text prompt is entered, it took about 30 seconds to generate the images. By default, each generation results in 4 images, and you can easily vary the image or upscale the image with a click of the button.
MidJourney also provide advanced features to customize the image generation. For example, you can specify the stylize values, quality values, advanced text weights, etc. Those advanced features can be triggered with "--feature_name value" after the text prompt.
MidJourney has a community show case, where you can find all the trending and top images generated by MidJourney. You can see the image, text prompt, as well as the user name of each generation. I find this features particularly useful for beginners to learn different effective text prompts. It also helps me get to know the limit of this software: what is possible and what is not.
Stable Diffusion:
For Stable Diffusion, I mainly use Lexica as the generation engine. The image below shows the general interface of Lexica.
Lexica, unlike MidJourney, is hosted on a web page. People can use it as a web application without the trouble of getting a discord account. The text prompt is entered in the text box. And it has a Negative prompt to enter anything you want to exclude from the resulting image.
Stable Diffusion(Lexica) also allows some advanced features to customize the image generation. Users can change/add those features by checking the check box or adjusting the slide bar.
Users can also search the generated images by using the search feature.
One interface that is unique to the Stable Diffusion is "Explore this style" feature, which allows you to see the results of similar generations/text prompt by other users as well as detailed parameters. I think it very useful to improve my own text prompt as I can always learn something from other people's generation.
The material stable diffusion has a very similar interface to Lexica but has more parameters you can change.
DALLE2:
DALLE is also hosted as a web application. It has a very simple interface where you can enter the text prompt in the text box and generate. However, unlike MidJourney and Stable Diffusion, I do not find a way to change the advanced features or customize the generation process. And it generally takes a little longer to generate the results.
What is unique to DALLE is that you can edit the image easily. I have covered this part in my DALLE exploration. Personally I find this feature very useful for the AI generated content as it allows a way to modify the result. Just like how the real artists can modify their art.
I think all the softwares above have good interfaces for the users to use. I would imagine a normal 10 year children can easily learn how to use them to generate images in a short amount of time. And I am very glad that I do not need to "be a programmer" to use those softwares.
Stable Diffusion(Lexica) and DALLE-2 have similar interfaces: web application with text box style prompt input, and there're nice UIs to use the advanced features. Personally I like them better than MidJourney because it separates the text prompt from any other features. I only need to focus on one thing at a time. In addition, the slide bar and check box UI are easier to use and more intuitive.
Personally, one big take away from using those text to image generation software is that I find another way to channel my creativity. I am not good at art -- I have never had art school, and I don't have enough time to sit down to make beautiful drawings either. However, those text to image softwares offer me another chance to make something looks good without professional knowledge or skill. All I need to do is to think about what I want in my head and enter the description as detailed as possible in text. Imagination is the only limitation.
When I took the 3D printing class MAT238 last quarter, I read that some people have different opinions towards the product of digital fabrication. They think 3D printing product as "inferior" because the creator usually don't have "what it takes" to create the final product. Instead, they give up to the machines and let machines to do all the work for them. I would assume similar arguments exist in the art world too. Soon there will be questions like "should we consider AI generated content as art" or "what it means to be a real artists". I am not here to answer those questions. However, as an end user of those softwares, I am definitely benefiting since it empowers me to do something I would never think about before -- generating fun/good looking/meaning images.
---------------------------------------------------------------------------------------------------
In the end, I want to talk about the future of AI generative content and explore the potentials and possibilities of text to image in the settings of VR/MR.
The metaverse has been a very popular concept recently. A metaverse is a network of 3D virtual words focused on social connection such as VR meeting, VR gathering, etc. With the help of a headset, you can have your own world.
The following link shows a video demo of part of my ongoing master research project:
https://youtu.be/LgRG3vUbqZk
In the project, I implement an avatar who you can have natural conversation with. At its core, I use GPT-3, an AI generative language model, to generative the response. I think the text to image generation may have bigger potential here. A few applications I can immediately think of: virtual art gallery, decorate your own virtual space, virtual 3D applications monaverse. The generative images will allow more excitements and imaginations in the metaverse.
The following image shows the general workflow to interact with AI generative content in VR/MR.
1.
It starts with the voice command to activate the general interface. I think the most natural way of input in the VR/MR is through voice. And there are existing speech to text methods, Voice SDK in Oculus for example, for us to use. We can also train a model to understand the vocie input and trigger appropriate functions.
2.
Voice input text prompt as well as slide bar adjustment. The prompt can be input via voice command with speech to text. For some advanced features such as text weight and size of the resulting image, we can implement slide bar adjustment for the users to use.
3.
Calling model API. Most of the text2img softwares have existing APIs for us to use. We can wrap the users' input and call the API for the image generation.
4. In the end, once we get the image from the model API. We can ask the users to do some post processing and place the image for use.
Having been working with VR/AR development in the Unity for a while, I believe the above general workflow is very possible for a successful implementation. Obviously, I made a lot of assumption, and the workflow could be different from platforms to platforms or applications to applications. However, this could be used as a foundation to explore the infinite potentials of AI generative content in the VR/MR.
My final presentation slide can be found here:
https://docs.google.com/presentation/d/ ... sp=sharing