Media Arts and Technology

Posted: **Fri Oct 18, 2024 3:34 pm**

Project 5: Course Final Project

The course has concentrated on image through MidJourney and Stable Diffusion but for the final project any AI software can be used, and in any media, meaning image, sound or moving image.

As mentioned in my email, it is important to critically engage at multiple levels with how we are approaching the problematics of AI image creation. We are looking to see how AI works through interacting with the software, and we also want to gain insights in terms of how to gain greater control.

Our prompts represent a point-of-view, expressing a belief system. When skillful negotiators interact, they shift their wording based on how they want to affect the person they are engaged with to arrive at an intended outcome.

For anything meaningful we need to consider how we word the text prompts. The text prompts need to be iteratively reworked, with slight adaptions, modifying meaning until a result is attained that is different from the initial prompt. This requires extensive re-processing. I dont think that anything less then 20 tries will have impact.

If all produced images fulfill the minimal standards of conventional interestingness, there cannot be innovation. The prompts should reconfigure how the image is to be assembled.

Its an open-ended final project but you should be guided by these two approaches:
1. There should be a thematic, or conceptual focus
2. Given an analytic description of the process, how things went, and your analysis of the results

Schedule:
Nov 19, Tues, Project 4 presentations

Nov 21, Thurs, Individual lab, and meetings (make sure to meet with me on either days once you have an idea of what the final project may be)
Nov 26, Research / lab (no formal class)

Nov 28, Thanksgiving

Presentation days:
December 3

December 5

Posted: **Tue Nov 26, 2024 10:35 pm**

This project explores how well Midjourney can identify and replicate prominent works by Black artists. Examining artists throughout time, I chose a famous piece, or at least one featured in an exhibition or article by popular art institutions, and asked Midjourney to recreate it.
Questions:

Given the artist, year and title, how well can AI recreate these technically and artistically challenging works?

Will the popular aesthetics of the time override the artists'?

Will the innate biases of the system be imposed over works from particular decades?

To what degree do I want AI to succeed at replicating the works of Black artists, who are already undervalued and exploited by these systems?

Workflow:

Locate a well-known work by a popular Black visual artist.

Ask Midjourney to recreate the image given the artist, date and title.

Referencing the original image, ask Midjourney to recreate the image given the artist, date and title.

Refresh the browser to reset the session.

No additional adjustment (other than aspect ratio to match the original) of parameters and no additional trials.
----------------
Test 1:

Original: Betye Saar’s “Black Girl’s Window,” 1969

Prompt 1: recreate Betye Saar’s 1969 "Black Girl’s Window"

I'm shocked by the elements from the original image that are successfully incorporated here. I feel the "Black girl's window" prompt is vague enough that I should have received a host of dissimilar images. However, all looked similar to this. Worn wood framing the subject, a collage of colorful images towards the top of the image, all transferred over. The general color scheme also transferred. E.g., blues, yellows and oranges. The model was certainly influenced by the original image.

Note: quotations surrounded the phrase Black girl's window, so this text is featured in all of the images

Prompt 2: (referencing the original) Recreate Betye Saar’s 1969 "Black Girl’s Window"

These results produced a girl at the center of the window rather than at the bottom. This is likely because many of the "good" photos and artworks the model was trained on make the subject central to the piece. Additionally, this result was the only one that produced a girl with such a deep skin complexion, and the prominent eyes also carried over. I like this image, although it is nowhere near as interesting or compelling as the original.

----------------
Test 2:

Original: Romare Bearden's "Jazz II," 1980

Prompt 1: recreate Romare Bearden's 1980 Jazz II

Bearden is known for his collage work, the influence of which is shown here. The colors, however, are much brighter and more varied than any of his work. The figures are all silhouettes/shadows. This is a common feature of Black artworks throughout the 20th century.

Prompt 2: (referencing the original) recreate Romare Bearden's 1980 Jazz II

This image introduced the greyscale figures mixed with the drawn and cartoon-like components. However, none of the results captured the number of figures in the image, nor did they fill the space in quite the same way. This is also a more colorful image.
----------------
Test 3:

Original: Jean-Michel Basquiat's "untitled," 1982

Prompt 1: recreate Jean-Michel Basquiat's 1982 untitled

I originally thought that Basquiat's artworks would be too easy since they're so widely recognized. However, this image and the other outputs failed to come even close to the actual painting. It's clear that Basquiat himself was identified, as all of the results featured a realistic young Black man with hairstyles somewhat resembling his. It tried to capture his paintings in the background, yet the result itself is not a painting. Therefore, it clearly knows who Basquiat is and the general aesthetic of his paintings, yet it failed to recreate "untitled."

Prompt 2: (referencing the original) recreate Jean-Michel Basquiat's 1982 untitled

These results look more like a Basquiat style self-portrait, which resembles very few of his works. Again, it is interesting that it chose to represent Basquiat in the image, but it did generally succeed at capturing his aesthetic.
----------------
Test 4:

Original: Lorna Simpson's "Stereo Styles photographs," 1988

Prompt 1: recreate Lorna Simpson's 1988 Stereo Styles

All of the results featured greyscale images and realistic photographs, like the original. It seems like this particular work by Simpson was informing the output. However, each featured only one woman. There's also a focus on her hair, as in the original, although this particular hairstyle is nowhere to be found.

Prompt 2: (referencing the original) recreate Lorna Simpson's 1988 Stereo Styles

This image appears to replicate the same woman and hairstyle three times. I generally find it very interesting, particularly the strip at the center of the photograph. However, the most compelling part of the original image is the varied hairstyles, separated by distinct frames. AI captured none of this.
----------------
Test 5:

Original: Carrie Mae Weems' "From Here I Saw What Happened," 1995-96

Prompt 1: recreate Carrie Mae Weems' 1995-96 From Here I Saw What Happened

The results were all in greyscale, with one in bluescale. Almost all women were profile facing in the images, except for one facing away. These seem to be informed by the actual photograph. However, the original woman appears to be African, which is unclear in this image. I'm also unsure why she is in the clouds with a skyscraper in the background.

Prompt 2: (referencing the original) recreate Carrie Mae Weems' 1995-96 From Here I Saw What Happened

It looks like it tried to make sense of her hair accessory and substituted the piece for more hair. All of these results present the woman by or through a window. The posture in the original is more confident and defiant. Based on the rounded shoulders in all of these results, the phrase, "From here I saw what happened," is being interpreted as a meek statement rather than an accusatory one.
----------------
Test 6:

Original: Mark Bradford's "Johnny the Jaguar (detail)," 2023

Prompt 1: recreate Mark Bradford's 2023 Johnny the Jaguar (detail)

With no additional prompting, all of the results were paintings. Although, this one appears three-dimensional. The colors are also more vibrant than the more muted and red artwork. The original work undoubtedly influenced this.

Prompt 2: (referencing the original) recreate Mark Bradford's 2023 Johnny the Jaguar (detail)

This one improved on the color scheme. However, the jaguars in these images were much larger and more prominent than in the original. The jaguar in the original blends into the painting; it's almost unnoticeable at first glance. Most of the body is indistinguishable from the rest of the paint strokes. This is not the case for the AI replications.

Overall notes:

There was general success in terms of the model recognizing the artists and their popular artistic styles and aesthetics.

All of the artists were recognized as Black, which is evident from the consistent depiction of Black people in all of my results.

There were no distinguishable differences between the results of the various years. The decade seemed to have no direct impact on output.

The question of whether I would like AI to successfully be able to replicate the works of Black artists is an open one. Lack of representation is a prominent issue, but overpopulating the training datasets with Black art without fairly compensating or crediting the artists does not seem like a solid solution. Furthermore, nearly all of the more subtle and intricate details that make the original artworks so compelling were lost in the AI replications. I was secretly pleased by this because it means that, at least for now, AI cannot reproduce artworks to the level humans can. When placed beside each other, the human-made art shines.

Posted: **Mon Dec 02, 2024 6:43 pm**

I’m interested in garbage. Tabloid magazines, reality TV, hyper-processed foods, elevator muzak… the forms of culture which have the widest reach and receive almost no critical inquiry, simply because they are too base, too low, too worthless. I think that the pop-up advertisement and the sponsored news-style spam advertisement are some of the most interesting cultural artefacts we have. They’re the visual language of the libidinal, they speak only in the language of the id. Sex, money, hunger, power, acceptance… each of these ads targets one (or more) of our base desires and strips that desire down to its most naked, shameful, transparent manifestation. In a strange way, I find them beautiful.

These advertisements have been around for a long time, as long as I can remember. Even before the proliferation of machine learning on the internet, these images have held an algorithmic function. They combine text and image to optimize a mass numerical quantity of clicks, molding their content to whatever provides the most engagement. Now, plenty of these advertisements are further abstracted by also using AI to generate their images and captions (though photographs are still in use).

I began this project by–to quote the Allen Ginsberg poem I referenced in an earlier assignment–”shopping for images.” I foraged for screenshots of advertisements, mainly on the DailyMail’s FeMail site and the New York Post’s PageSix site. Charles also sent me a screenshot of an advertisement he’s been seeing a lot on his computer, using AI generated images of women with the caption “Find a Soulmate from Ukraine.”

I then used StableDiffusion to generate similar images, using variations of language like “tabloid website homepage trashy scam spam pop-up advertisment.”

My artistic goal for this project was to create a sort of behemoth of the id. So I opened a blank document in Photoshop and added both my organic advertisements and the generated images I had made. From there, I used Photoshop’s “Generative Expand” feature to turn the individual images into a sort of pattern. I then used that expanded image to create a third image, which (a bit ironically) felt the most visually ‘calm’ as the individual components had been subsumed into the pattern of the whole.

I wanted to introduce some text-based AI into the project, so I input my largest image into MaxAi.me, allowing me to input the same command into four chatbots simultaneously–though the file size was too big for two programs to read, leaving me with responses from Gemini and GPT. I asked both programs to “Please write me copy text to advertise and sell this product.” While Gemini assessed the image as a garbled mess of nonsense and provided copy as such, GPT gave copy for a dating service for Ukrainian women.

I manually imposed the text back onto the images. For one image, I used a more design-y approach and placed the text in a computer-y font inside a Web1.0 style pop-up. For another, I wanted to maintain the nonsense-chaos-noise of the image and simply pasted the text directly over the image.

In many and obvious ways, these images are deeply “digital.” (If they remind me of anything, in a lot of ways they look like old issues of TigerBeat and J-14...) But the internet is not some sort of separate entity from “real life”--it has both tangible (undersea cables, server farms in Ireland, carbon emissions) and intangible (epistemological, rhetorical, ontological) effects on our lives and our desires. As a final little capstone on the class, I used Photoshop to manually insert these images into spaces we would normally see advertising. I hope that these images appear to be both topographical and cellular, both macro and micro, examinations of libidinal desire and digital culture.

Posted: **Tue Dec 03, 2024 1:25 pm**

In this final project I delved into the image generation using Stable Diffusion 1.5 and ControlNet models.

Stable Diffusion 1.5 known for its ability to generate detailed and coherent visuals based on prompts.

ControlNet is a neural network structure that allows us to control diffusion models like Stable Diffusion by incorporating additional inputs, such as sketches, depth maps, or segmentation maps. This enables more precise manipulation of the generated images.

In this project I wanted to see how far I could go in customizing and directing the output of these models to produce images that are not only visually good but also reflective of specific artistic concepts. By using ControlNet, I aimed to overcome some limitations of traditional text-to-image generation, such as lack of control over specific elements in the scene.

OpenPose Model:
Function: Used to control the pose and movement of human figures.
Application: By inputting human skeleton images, we can generate images with specific poses, useful for character design, dance motion capture, and more.

A 3D render of a girl sitting comfortably on a chair in a glass made house, in the art style of The Sims 4, highly detailed, smooth textures, vibrant colors, soft lighting, cheerful atmosphere, digital illustration
Negative prompt: blurry, low quality, deformed, disfigured, extra limbs, text, watermark, out of frame, distorted hands

Steps: 20, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 7, Seed: 2159576166, Size: 710x1000, Model hash: cc6cb27103, Model: v1-5-pruned-emaonly, Denoising strength: 0.75, Token merging ratio: 0.2, ControlNet 0: "Module: openpose_full, Model: control_v11p_sd15_openpose [cab727d4], Weight: 1.0, Resize Mode: Crop and Resize, Processor Res: 512, Threshold A: 0.5, Threshold B: 0.5, Guidance Start: 0.0, Guidance End: 1.0, Pixel Perfect: False, Control Mode: Balanced", Noise multiplier: 1.05, Version: v1.9.3

A 3D render of a girl sitting comfortably on a chair in a garden, in the art style of The Sims 4, highly detailed, smooth textures, vibrant colors, soft lighting, cheerful atmosphere, digital illustration

Canny Edge Detection Model:
Function: Captures the outlines and edges of images.
Application: By inputting edge maps, we can control the overall shape and structure of the image, retaining the details of the original sketch.

A 3D render of a girl sitting comfortably on a chair in a garden, in the art style of The Sims 4, highly detailed, smooth textures, vibrant colors, soft lighting, cheerful atmosphere, digital illustration
Negative prompt: blurry, low quality, deformed, disfigured, extra limbs, text, watermark, out of frame, distorted hands

Steps: 20, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 7, Seed: 3231923029, Size: 710x1000, Model hash: cc6cb27103, Model: v1-5-pruned-emaonly, Denoising strength: 0.75, Token merging ratio: 0.2, ControlNet 0: "Module: canny, Model: control_v11p_sd15_canny [d14c016b], Weight: 1.0, Resize Mode: Crop and Resize, Processor Res: 512, Threshold A: 100.0, Threshold B: 200.0, Guidance Start: 0.0, Guidance End: 1.0, Pixel Perfect: False, Control Mode: My prompt is more important", Noise multiplier: 1.05, Version: v1.9.3

Different Control Mode:

CFG scale: 12

A 3D rendering of a Dunhuang mural with deep depth, volumetric lighting, realistic shadows, highly detailed textures, cinematic atmosphere, rendered in Unreal Engine, ultra high definition
Negative prompt: low quality, blurry, deformed, watermark, text

Steps: 20, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 7, Seed: 2609786599, Size: 960x473, Model hash: cc6cb27103, Model: v1-5-pruned-emaonly, Denoising strength: 0.75, Token merging ratio: 0.2, ControlNet 0: "Module: canny, Model: control_v11p_sd15_canny [d14c016b], Weight: 1.0, Resize Mode: Crop and Resize, Processor Res: 512, Threshold A: 100.0, Threshold B: 200.0, Guidance Start: 0.0, Guidance End: 1.0, Pixel Perfect: False, Control Mode: Balanced", Noise multiplier: 1.05, Version: v1.9.3

Generation without controlnet:

Depth Model:
Function: Adds depth and spatial perception to images.
Application: By inputting depth maps, we can transform flat images into images with three-dimensional effects, enhancing the dimensionality of the scene.

Oringinal img:

Depth img:

Generation:
A mystical fairytale forest filled with glowing mushrooms and floating fireflies, tall ancient trees with twisted branches, mist rolling over the forest floor, soft ethereal lighting, enchanted atmosphere, ultra-detailed, high-resolution, fantasy art, volumetric lighting, 8K, Unreal Engine render

Steps: 20, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 12, Seed: 2759343917, Size: 1080x696, Model hash: cc6cb27103, Model: v1-5-pruned-emaonly, Token merging ratio: 0.2, ControlNet 0: "Module: depth_midas, Model: control_v11f1p_sd15_depth [cfd03158], Weight: 1.0, Resize Mode: Crop and Resize, Processor Res: 512, Threshold A: 0.5, Threshold B: 0.5, Guidance Start: 0.0, Guidance End: 1.0, Pixel Perfect: False, Control Mode: Balanced", Version: v1.9.3

A mystical fairytale forest filled with glowing mushrooms and floating fireflies, tall ancient trees with twisted branches, mist rolling over the forest floor, soft ethereal lighting, enchanted atmosphere, vibrant and dreamy rainbow colors, colorful flora and plants, magical light effects, ultra-detailed, high-resolution, fantasy art, volumetric lighting, 8K, Unreal Engine render
Negative prompt: low quality, blurry, deformed, disfigured, noisy, overexposed, underexposed, dull colors, mediocre color grading, muted tones, flat lighting, grayscale, monochrome, dark atmosphere, watermark, text, people, animals

Steps: 20, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 12, Seed: 1265101828, Size: 1080x696, Model hash: cc6cb27103, Model: v1-5-pruned-emaonly, Token merging ratio: 0.2, ControlNet 0: "Module: depth_midas, Model: control_v11f1p_sd15_depth [cfd03158], Weight: 1.0, Resize Mode: Crop and Resize, Processor Res: 512, Threshold A: 0.5, Threshold B: 0.5, Guidance Start: 0.0, Guidance End: 1.0, Pixel Perfect: False, Control Mode: Balanced", Version: v1.9.3

Comparison between ControlNet Canny model and Depth model:

Segmentation Model:
Function: Performs semantic segmentation, allowing different content or styles to be applied to different regions.
Application: Enables replacing or applying different artistic styles to regions like the sky, ground, buildings within an image, achieving scene replacement and style mixing.

Original img:

Segmentation img:

Generated prompts n settings:
A breathtaking fantasy landscape with a vibrant sunset sky filled with swirling nebula clouds and sparkling stars, the ground covered in lush, luminescent flora with hues of purple and blue, towering crystal formations replacing buildings, majestic floating islands in the background, rendered in a dreamy and colorful 3D style, ultra-detailed, high resolution, soft lighting, cinematic atmosphere, Octane Render, 8K
Negative prompt: low quality, blurry, dull colors, flat lighting, noise, artifacts, watermark, text, overexposed, underexposed

Steps: 20, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 12, Seed: 1265101828, Size: 696x1080, Model hash: cc6cb27103, Model: v1-5-pruned-emaonly, Token merging ratio: 0.2, ControlNet 0: "Module: seg_ofade20k, Model: control_v11p_sd15_seg [e1f51eb9], Weight: 1.0, Resize Mode: Crop and Resize, Processor Res: 512, Threshold A: 0.5, Threshold B: 0.5, Guidance Start: 0.0, Guidance End: 1.0, Pixel Perfect: False, Control Mode: Balanced", Version: v1.9.3

Generated img:

Posted: **Tue Dec 03, 2024 1:33 pm**

continue a previous post

Segmentation Model:

Original img:

Segmentation img:

Generation prompts:
A breathtaking fantasy landscape with a vibrant sunset sky filled with swirling nebula clouds and sparkling stars, a shimmering river with crystal-clear water flowing where the road used to be, the river reflects the colorful sky above, the ground covered in lush, luminescent flora, towering crystal formations replacing buildings, majestic floating islands in the background, rendered in a dreamy and rainbow color3D style, ultra-detailed, high resolution, soft lighting, cinematic atmosphere, volumetric lighting, Octane Render, 8K
Negative prompt: roads, vehicles, low quality, blurry, dull colors, flat lighting, noise, artifacts, watermark, text, overexposed, underexposed

Steps: 20, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 12, Seed: 1265101828, Size: 1080x696, Model hash: cc6cb27103, Model: v1-5-pruned-emaonly, Token merging ratio: 0.2, ControlNet 0: "Module: seg_ofade20k, Model: control_v11p_sd15_seg [e1f51eb9], Weight: 1.0, Resize Mode: Crop and Resize, Processor Res: 512, Threshold A: 0.5, Threshold B: 0.5, Guidance Start: 0.0, Guidance End: 1.0, Pixel Perfect: False, Control Mode: Balanced", Version: v1.9.3

Generated img:

Work UI:

Generated img:

Comparison between Depth model and Segmentation:

MultiControlNet

Prompt:
An enchanting fantasy realm where skyscrapers are transformed into gleaming crystal structures reflecting all the colors of the rainbow, magnificent rainbows stretch across a sky ablaze with vibrant hues, roads have become tranquil, shimmering water surfaces mirroring the kaleidoscopic sky, traffic replaced by ethereal floating plankton emitting a soft, multicolored glow, a girl standing gracefully on a giant, luminous lotus floating atop the iridescent water, surrounded by gentle ripples illuminated by prismatic light, warm and colorful ethereal lighting bathes the scene, enchanted atmosphere, vibrant and dreamy rainbow colors throughout, magical and radiant light effects, ultra-detailed, high-resolution, fantasy art, volumetric lighting, 8K, Unreal Engine render

Negative prompt: low quality, blurry, deformed, disfigured, noisy, overexposed, underexposed, muted colors, dull colors, grayscale, monochrome, dark atmosphere, watermark, text, traffic, animals

Steps: 20, Sampler: DPM++ 2M, Schedule type: Karras, CFG scale: 15, Seed: 1519522183, Size: 696x1080, Model hash: cc6cb27103, Model: v1-5-pruned-emaonly, Token merging ratio: 0.2, ControlNet 0: "Module: seg_ofade20k, Model: control_v11p_sd15_seg [e1f51eb9], Weight: 1.0, Resize Mode: Crop and Resize, Processor Res: 512, Threshold A: 0.5, Threshold B: 0.5, Guidance Start: 0.0, Guidance End: 1.0, Pixel Perfect: False, Control Mode: Balanced", ControlNet 1: "Module: openpose_full, Model: control_v11p_sd15_openpose [cab727d4], Weight: 1.0, Resize Mode: Crop and Resize, Processor Res: 512, Threshold A: 0.5, Threshold B: 0.5, Guidance Start: 0.0, Guidance End: 1.0, Pixel Perfect: False, Control Mode: Balanced", Version: v1.9.3

Result:

Posted: **Tue Dec 03, 2024 8:15 pm**

Project 5 - Yuehao Gao
Scrutinizing AI's Understanding of Chinese and American Cultures through Generated Pictures and Music
MAT255, Fall 2024

For this course's final project, I will scrutinize generative AI Models' understanding of the relationship between Chinese and American cultures. Specifically, the models will be asked to imagine pictures and music that demonstrate these two cultures during individual cultural events, personal scenarios, and the relationship between the two cultural groups during harmonious communicative occasions when the atmosphere is tense, also when both cultures are united to fight for a larger goal. All the detailed scenarios, elements, and features of the generated pictures and music will be decided by the AI models. The resulting pictures and music will be examined to see if they accurately capture the core of the cultures, demonstrate the atmosphere of the scene, or bring any elements or features that are unexpected.

There will be three models utilized for this project:
(1). Chat GPT for imagining the prompts for generating the pictures and music
(2). Midjourney for the pictures
(3). SUNO AI for the music pieces

Specifically, the project starts with Chat GPT, such that I asked it to generate the picture and music prompts for different scenarios separately. I have entered: "Please imagine the prompts to AI pictures and music for the following scenes for Chinese and American cultures: during individual celebrations, individual cultural events, collaborative culture when having friendly relationships, cultures when having tense relationships, and collaborative cultures when collaborating to defend the intrude of aliens.". After Chat GPT exports the prompt for the pictures and music, I separately fed them into Midjourney and SUNO AI. If these generators' results are acceptable, they are then collected and analyzed. However, if the results are significantly diverged from expected results, especially if due to lacking key words in the prompt, then the new prompt will be entered to generate new results. The following is a workflow diagram:

This is to say, the Chat GPT is the major model being tested for imagining "what can be happening" in those 5 scenarios, respectively within Chinese cultures, American cultures, and merged cultures, while Midjourney and SUNO are tested more on their functions of illustrating the detailed visual or musical cultural elements. The following are the screenshots of the user interface of Midjourney and SUNO AI:

Since music files cannot be attached to this forum, they will be only analyzed in plain text and played during the presentation.

--- --- --- Scenario 1: Individual Celebrations --- --- ---
Prompt by Chat GPT:
Picture for Chinese Context: A one-year-old baby sitting on a red silk mat surrounded by symbolic items like a calligraphy brush, coins, and a jade bracelet, with family members wearing traditional Chinese attire watching in anticipation.
Music for Chinese Context: Gentle traditional Guzheng music with soft flute melodies, evoking a sense of heritage and warmth.
Picture for American Context: A vibrant Sweet Sixteen party with teenagers dancing under fairy lights, a towering pink cake, and balloons spelling '16' in the background.
Music for American Context: Upbeat pop music with energetic rhythms and a prominent electronic beat, reflecting youthful celebration.

Generally, it is indispensable to say that birthdays are important individual celebrations for families and friends within both Chinese and American cultures. GPT has expressed the vibe and atmosphere in a clear, precise, and vivid way. The following are how Midjourney imagined the scenarios based on these texts:

The overall vibe accurately demonstrates the joyfulness when a baby is having his or her birthday, with family members cheerfully gathering around within Chinese cultures, or when a 16-year-old girl is gathering her friends in a much livelier atmosphere during her birthday celebration in American culture. Just as how most Chinese cultures overwhelm family over friends while American youth cultures may differ, the pictures accurately illustrate this point.

However, two pictures are significantly interesting: the second for the "Chinese cultures" are having a line of letters that look like a "watermark". It is obvious that the samples that the Midjourney learned from, especially when searching for "Chinese babies", have a watermark in the picture. At the same time, the first picture for "American cultures" has the letters "116" instead of "16" on the cake, which further indicates that the picture model currently lacks the mechanisms that regulate the preciseness of word letters.

When SUNO AI generates the music for "Chinese individual celebrations", it is disappointing to find that both variations are only consisted of light piano music: despite the C-major and the soft textures of the notes does create a warm family-like atmosphere, it is not played by traditional instruments like Guzheng as specified. Therefore, after I manually modified the prompt by adding a "Chinese" in the front, the new variations more precisely illustrated the scene by soft, warm, slow-tempo music that creates even more feeling of "being born in an Eastern family".

While the two versions of "Neon Lights" starts with soft and light piano melodies only, they gradually evolve to tense, dance-inducing and exciting tones when evolving to the chorus parts. Especially, SUNO AI adds strong dance beats as well as electric MIDI notes to the music, boosting the partying vibe during the celebration of an American teenager's birthday.

--- --- --- Scenario 2: Individual Cultural Contexts --- --- ---
Prompt by Chat GPT:
Picture for Chinese Context: A serene Chinese scholar in a silk robe practicing calligraphy by a lotus pond, with a tea set steaming gently beside them.
Music for Chinese Context: Minimalist erhu and pipa music with flowing, meditative tones, emphasizing focus and tradition.
Picture for American Context: A cowboy tying his boots and adjusting his hat near a barnyard, with a golden sunset casting long shadows over the open plains.
Music for American Context: Country music with acoustic guitars and harmonicas, evoking a sense of rugged individualism and tradition.

For both cultures, Chat GPT decided to create a figure that demonstrates solitude and being alone. While it poured much more emphasis on its comprehension of Chinese culture being "poetic", with the characters dressed up in ancient costumes, it understood the American version of "solitude" of being lonely cowboys. Both cultures involve the settings to be in nature: while Chinese involves mountains and bamboo forests, American involves large farmland and deserts.

This time, Midjourney precisely captures the scene for both cultures, as evidenced by a high congruence for all four variations of pictures for both cultures:

SUNO AI also performs better in this scenario. For illustrating the scene describing the Chinese poet being alone, it composed a piece titled "Quiet Heart", that consist of slow, quiet, and meditative melodies played by Guzheng (dispite the MIDI system made it sounded more like Harp). The entire piece is in G major, with most notes being designed to fit into the Chinese pentatonic musical scale, which aligns with the acoustic features of Guzhengs in real lives. Still, the sudden change in tone happened in the middle of the music is not achievable as Guzhengs are usually fixed to one tone within a single piece due to their sounding nature.

It becomes more interesting to investigate on how SUNO composed the country music with lyric for the cowboy scenes. Generally, the use of major tones, drum sets, electric guitars and bass guitars generally give an uplifting and happy vibe. Despite the lyric does shows the theme of "lonely road home", the harmonic progressions created a significant contrast. It may be the case that American country music generally tend to drag people out from depressive moods and boosts hopes, which is a "hidden cultural context" that I did not expected before.

--- --- --- Scenario 3: Collaborative Cultures in Friendly Relationships --- --- ---
Prompt by Chat GPT:
Picture for Chinese-American Fusion Context: A lively dinner table with dumplings, barbecue ribs, fortune cookies, and apple pie, surrounded by people laughing and chatting in a cozy dining room.
Music for Chinese-American Fusion Context: A fusion of traditional Chinese instruments and American jazz, creating a harmonious blend of melodies.

Let's not worry about if fortune cookies are traditional Chinese food or not. At least, it is obvious that the prompt lack of key descriptions of the ethnics of people. Therefore, I upgraded the prompt for images by changing "people" into more specified "Chinese and American people".

Despite with the command "--style raw" which asks the model to generate pictures in real-life styles, all the pictures that involve at least some sense of "Chinese people sitting around with Americans" are drawn in Cartoon styles. While the only picture that is indeed in real-life style is only consisted of Eastern figures, rather than a fusion of people from different cultural groups. This significantly indicates that Midjourney is lacking of data for "joint celebration between Chinese and Americans" in its dataset. It could also be possible that relevant pictures do exist in Midjourney's database, but proper function-labels like "joint", "fusion", "having fun together" and "cultural communication" are not given to those pictures, hindering the model from merging the data from two sets into one.

The first iteration for SUNO AI demonstrating the scene is also somewhat limited. The music begins with traditional pentatonic Guzheng notes only, then gradually proceeding into a pop-styled English song, but only with a very little sense of jazz, especially with a lack of semi-tones and eight-note bass notes. Hence, a newer prompt is given to SUNO AI: "A jazz music played by traditional Chinese instruments.". Nevertheless, it was amazed to see SUNO's combinational creativity this time. The music "Shanghai Serenade" features a jazz feeling in the background, while the major melodies are Guzheng and Dizi notes in pentatonic scales mainly. This combination does creates a happy and lively vibe for collaborative culture communication scenes.

--- --- --- Scenario 4: Cultures in Tense Relationships --- --- ---
Prompt by Chat GPT:
Picture for Chinese Context: Two Chinese generals in ornate armor standing on opposite sides of a negotiation table in a candle-lit war tent, with maps and calligraphy scrolls spread out.
Music for Chinese Context: Tense guzheng and taiko drum patterns, creating a sense of conflict and anticipation.
Picture for American Context: A modern corporate boardroom with sharp-suited executives locked in a heated debate, the skyline of a bustling city visible through the window.
Music for American Context: Low, pulsing electronic tones with sparse percussion, building a tense and serious atmosphere.

It is inevitable when two countries fall into tense relationships due to conflicts in benefits. As AI models may not generate scenarios for actual conflicts as they are not encouraged to advertise contents that may hinder harmony, it is reasonable to imagine scenes where individual political or military leaders are having meetings to prepare sudden unexpected events. The following are how Midjourney illustrated these scenes:

The very hilarious thing is that, all four iterations for "Chinese generals" are generals from hundreds of years ago. Despite that Google images displays the pictures for modern Chinese military generals, it is obvious that Midjourney is heavily "brainwashed" by television shows that illustrates ancient Chinese dynasties. While Chat GPT shall mean "a tent in a modern military field", Midjourney understood is as "a tent in an ancient battlefield", resulting in antique-themed pictures on the left.

For the American scenario, all pictures are set in a high-tech glass room of a tall building. As the common sense tells me that military meetings are usually held in secret rooms with absolutely no windows, probably in the Pentagon, it also indicated that Midjourney has possibly gathered a lot of data samples from movies and television shows: these artistic works tend to illustrate American military and technology meetings in tall glass buildings, and "brainwashed" Midjourney as well. One exception is the fourth variation, where no human characters are presenting in the entire canvas.

While for SUNO AI, it does successfully creates tense atmosphere in music for both Chinese and American scenarios. For the Chinese illustration, it creates a Guzheng melody that features a fast-tempo D-minor tone that shrinks the interval little-by-little, giving the sense that a storm or danger is slowly approaching and hiding somewhere around. The fast percussive instrument in the back also demonstrated this sense in general. However, SUNO AI did not successfully create a tense atmosphere describing the tension for the American cultures: the use of piano and electrical MIDI notes does illustrated the theme that the people discussing the military and politics are elites, but the melodies are gradually turning up, while the "i-VI-III-V" chordal progression features the sense of strength. However, this loop is more know to be illustrating strengths for positive moods and hope, and is not suitable for this cultural scenario.

--- --- --- Scenario 5: Collaborative Cultures to Defend Against Alien Intrusion --- --- ---
Prompt by Chat GPT:
Picture Prompt: Chinese scientists in futuristic lab coats working with American NASA mission control scientists, gathered around a central screen showing an alien spacecraft approaching Earth.
Music Prompt: Majestic orchestral themes with brass, strings and Chinese instruments, underscored by electronic synthesizers for a sci-fi edge.

It is highly reasonable for GPT or any human beings to think of the scenario of scientists from all countries gathering together to study how to defense against the intrude of aliens. These are how Midjourney imagined such a scene:

Similar to the problem in Scenario 3, the pictures are lacking of "fusion" between Chinese and American characters. This could also be due to the lack of data or lack of functioning labels that drive the model to link Chinese scientists and American scientists together in a collaborative scenario and working toward a collaborative goal.

SUNO AI faces similar challenge when illustrating this scenario. Specifically, while it successfully incorporated electrical MIDI notes, which are arranged in fast descending or ascending F-minor triads, with a strong drum loop joining at the chorus part, which boosts the sense of high-tech and tension at the same time, it just completely ignored Chinese musical elements like Guzheng and pentatonic notes. While it is indeed hard to combine all the musical elements in one piece, SUNO have actually succeeded in Scenario 3. The reason why may be that key cultural words like "Chinese" should appear in the very front of the prompt, so that the NLP algorithm will increase its weight on the process of creating the music.

------- Conclusion --------
In general, all the AI models have performed well in imagining Chinese and American cultural scenarios. While Chat GPT performed well in understanding the basic cultures, it sometimes miss crucial words that highlight the significance of cultural elements, which shall lead to inaccuracies for Midjourney and SUNO. Midjourney is strong in illustrating what is given, but its imagination based on data from movies and TV shows, together with its lack of "combinational creativity" can limit its performance in cross-cultural contents. Finally, while SUNO AI performs well in most scenarios, it is highly rigorous on the sequence of the key words and is sensitive to the amount of elements in its input prompt. Still, despite of the minor flaws in these models, they are still highly helpful for generating the pictures and music pieces for this project, as they have showed their astounding progress in deep-learning algorithms. I am highly looking forward to see their future developments, and how they shall boost positive inter-cultural understanding in the coming years.

Posted: **Wed Dec 04, 2024 6:12 pm**

For my final project, I want to explore the environmental story-telling capabilities and tendencies of the fine-tuned model I made. I will analyze how the model creates environmental context by studying background details and world-building elements. I will focus on how it documents atmosphere and mood, as well as note any recurring environmental patterns.

I'm interested in sketching settings for video game scenes I can use for sketch ups in my other work. I am drawn to the early video game aesthetic.

Through analyzing a series of generated images across diverse environmental scenarios, several key patterns emerge in the model's approach to world-building. The images consistently feature diffused, misty lighting and demonstrate strong capabilities in depicting atmospheric conditions and environmental boundaries. While the model excels at creating mood through lighting and texture, it sometimes struggles with scale interpretation and tends to produce scenes reminiscent of early 3D video game art. The generated environments often include unspecified elements not mentioned in prompts, such as background mountains or additional structures, and generally favor smaller-scale scenes unless explicitly directed otherwise. Notable patterns include a preference for temperate climate aesthetics, consistent shadow work, and a somewhat muted color palette.

`A forgotten lighthouse on a misty morning, waves crashing against rocky shores a TOK emoticon`

This looks like a typical lighthouse complete with a path leading down to shore. It actually does look a little beaten and not shiny and futuristic. It has a bit of the simplified texture feel of early 3d art. It also does successfully convey a misty morning with the sun close to the horizon without the colors a sunset would have. It does successfully focus the sun as a light source through the clouds. The waves do look violent against the shore and there is even a sort of motion effect that you would get from long exposure photography. The lighthouse actually looks a bit too short to be accurate. There is no wildlife depicted, probably because it wasn't in the prompt.

`A forgotten lighthouse during a heat wave, calm waters below rocky shores a TOK emoticon`

This one actually has a great reflection and some cool retro rock texture. The bands of the lighthouse are even mostly consistent in the reflection.
It actually looks too calm for a normal sea but given the prompt I would say it did well. It captures the atmospheric conditions well in the photo. Even though it says heat wave, the setting didn't really change from a more temperate, wet climate -- although this location had pine trees and looks less hostile.

`An ancient temple being slowly reclaimed by jungle vines, morning light filtering through gaps a TOK emoticon`

This image captures the morning light filtering through the gaps really well, although it displays more of the copyright artifacts along the edges. It seems like a very generic video game temple in a jungle, and includes moss as well as a sort of cross between Mesoamerican staircase features on temples and the sort of angular stupas of Southern Asia. There seems to be flora similar to ferns in the foreground as well as vines wrapping around the trunk or temple pillar in the left foreground.

`A modern glass skyscraper weathering a sandstorm, desert plants taking root on its ledges a TOK emoticon`

This one successfully depicts not only a desert but also a sandstorm -- the horizon is obscured and it handles the atmospheric fog effect very well with the lighting. However, it is interesting that there are two buildings standing alone without any other remnants of a city -- such as human presence, paths roads or any other navigation elements. It also has plants in the foreground but not taking root in the ledges of the skyscrapers. Also, the prompt specifies only one but there are two.

`A river valley where grassland transitions into pine forest, wildlife gathering at the boundary a TOK emoticon`

This image handles the biome transitions and environmental boundaries of the prompt very well -- it even feature different types of wildlife in the image from above. Note the animals in the water, which looks shallow enough to support standing. There are also a variety of colors of the animals -- brown which looks like bears or horses, and white which looks like sheep or cattle. There is variation in the river which suggests the reflection of the sky. There is also mist coming from between the mountains in the valley, which lends a sense of mood that is similar to the first lighthouse picture. There are also accurate and consistent shadows being cast, both of the hills and of the pine trees and even some of the animals, all pivoting towards the left of the photo.

`A volcanic hot spring meeting a snow-covered field, steam rising where they meet a TOK emoticon`

This picture is impressive with the way it accurately handles the environmental boundary specification of "steam rising where they meet." Interestingly, a mountain is added in the background despite there not being one in the prompt. This one also has interesting, soft, diffused light and consistent shadows as well. Notice the shadows on the smoke, the mountain, and the rocks in the foreground. One thing that is missing is that I would expect some more brilliant color like the pools in Yellowstone but so far I have found that the fine-tuned model does not produce a lot of color.

`An overgrown garden path leading to an abandoned greenhouse, gardening tools scattered nearby a TOK emoticon`

This is one of my favorite pictures because it really reminds me of an early video game, like a Barbie Adventures video game where this is a still and you have to look in the foreground for clues to a mystery you are trying to solve as well as finding missing objects. However, the objects in the image are pretty unusual. There is a guitar, a basketball, and some kind of hammer (which could also be a watering can). There is also some sort of pipe and bucket, and another structure to the right. I like that the image has a glowy, misty, diffused light look that many of them have. This one even features a stone path leading up to the greenhouse. Without the basketball, this could have been a bit more timeless, but the basketball makes it seem like it is a contemporary backyard. Also, it seems like the model took "greenhouse" literally and made a house that is green instead of a glass paned greenhouse.

`A remote mountain cave with ancient cave paintings, recent camping equipment visible a TOK emoticon`

This one is another favorite image of mine -- I think it layers the historical and contemporary human presence really well. The tent looks sort of accurate, like an explorer's burlap tent, and the great in the front looks contemporary although it is a bit hard to tell what everything is. The cave paintings are also very funny, and feature a rust colored pigment. However, it isn't really clear what cultural style the cave paintings are going for. If anything, the iconography within the paintings is not ancient but contemporary, the largest one looks like a wood-burning stove you could buy today.

`A giant ancient gear half-buried in a meadow, small houses built into its crevices a TOK emoticon`

By this point I've started to notice how much these photos look like early video games. The background looks like it could be from a medieval game or a Civilization style game. In this image, you can see how the model struggled to understand semantically what I meant. I would have expected a gear the size of a mountain with houses dotted on it, not a sort of house emerging from a house-sized gear, complete with a cross on top. Scale doesn't seem to be vast in any of these images by default, despite semantic implication of a gear that can fit houses on it. This suggests to me that the model produces smaller worlds unless specified otherwise. It generally has a difficult time establishing scale.

`A floating market where boats navigate between the roots of massive bioluminescent trees a TOK emoticon`

For some reason, scale was a lot easier to generate maybe because the tree is successfully large enough to fit a market underneath it. I feel like there are more depictions of gigantic world-trees in fantasy illustrations than there are giant gears in hills, at least that's what I've seen anecdotally. The model doesn't seem to make a decision about what time of day it is -- where the boat markets default to a night-market setting with candles lit everywhere, and the sky which seems to have the same diffused light daytime mist look. However, the planet in the sky suggests a moon, although it could be a completely different planet altogether.

Posted: **Thu Dec 05, 2024 2:41 pm**

For my final project, I wanted to work on a time based work that draws on my filmmaking background while incorporating learnings from my experiments with generative imaging tools this quarter. For an earlier project, I had tried using the platform Runway but my efforts with it this time around were frustrated by the pricing/licensing of the product. Apparently I had exhausted all my onetime "free" credits during that project and no longer could use the platform without purchasing additional credits. I did some research on other generative AI platforms for which trials were available and after testing out half a dozen products finally settled on Stable Video (a video platform from the Stable Diffusion folks) and Kling.ai that another student had recommended during their presentation. Finally I also added a soundtrack to the video doing editing featuring various Indian percussion instruments that I downloaded from an online royalty-free music database.

Being a filmmaker and trained cinematographer, I decided to make a short compilation film based on a singular theme and character. Although I did not care for a narrative or plot line as such, the idea was to capture a somewhat mystical space like an enchanted forest that is often featured in mythical tales about Indian sadhus or ascetics who are (in)famous for leading solitary lives away from civilized society often indulging in occult practices and rituals. The film would essentially be an edited sequence of multiple different shots all centered around this core concept, emphasizing the cinematic elements of visual storytelling (shot types, framing, camera techniques, etc.). I started with a rough draft of shot descriptions with details of composition, camera angles, look and texture of image, etc. and kept refining them and playing around with the text prompts.

I was especially impressed with the cinematic quality of the shots generated by Kling.ai. What I found especially interesting was the ability to specify beginning and end frames using image upload. I was pretty blown away by the results of the resulting videos that had quite complicated camera movements and character motion notwithstanding several noticeable visual artifacts in many cases which were distracting. It also had options for Camera Movement which was pretty impressive. I particularly toyed with zooming in, tilt angles and Roll. Coupling this with specific image motifs like "smoke wafting" or "incense smoke everywhere" I was quite satisfied with the resulting atmospherics of the visuals. The platform also seemed to handle character motions within the clips quite accurately. For instance, with the image upload option used to provide an indian sadhu's starting image I was amazed to see how the system simulated his motion within the frame through a combination of visual techniques such as scaling down of objects, adding eye movements, etc.

The videos generated by Stable Video in contrast seemed overly stylized and very AI-like for some reason. Although the resolution and quality of the images were arguably superior I found it harder to control for desired camera movements and angles, and often had to settle with the closest-next-best-thing for a lot of the shots. What I did like about the Stable Video footage though was the control I was able to exercise over the look. For instance, most of my prompts included descriptions of lighting requirements as well as ambient light conditions ("late evening", "dusk", "strong backlight") which I thought was rendered quite accurately. The backlighting of characters is a standard technique in special effects cinematography to ensure continuity between fast motion shots such as action sequences that often involve body doubles that may not share 100% facial similarities. I was especially interested in seeing whether I can create a consistent look cinematographically speaking to produce a visually coherent sequence of multiple clips. To this end, I used specific tags like "rainforest" "ancient trees" and "incense smoke" in all prompts to steer the model toward the type of atmospherics I wanted to achieve in the visuals. At a technical level, I am not sure if the model was playing off of previously generated visuals in the series to create a consistent "latent space" as it were but the results were very much in the ballpark of the visual setting I had in mind.

Overall, I felt pleasantly surprised at the final result because it felt like I was able to produce a coherent enough visual look for the film stringing together footage from different platforms which is often what happens in motion pictures too (cinematographer shoots on multiple film stocks, uses multiple digital LUTs, to design the overall look of a film.) I also felt a lot more comfortable getting the system to do what I wanted, meaning "prompt engineering"felt a lot less of a struggle than when I started out in the course.

Link to video posted on Vimeo: https://vimeo.com/1037879371/2f08bea5ae?share=copy

Posted: **Thu Dec 05, 2024 4:25 pm**

Final Project: " Food as architecture: Organic Urban Design"

For my final project I wanted to explore the intersection of AI and architecture specifically, how AI tools can reimagine architecture forms inspired by unconventional themes. I choose to explore "Food inspired Architecture" The idea was to examine how organic, fluid forms derived from food aesthetic Ould inspire new architectural visions. I believe this approach challenges traditional architectural paradigms, inviting us to reconsider functionally and beauty win built environments.

FRIST SET OF IAMGES:

Prompt: "Architecture Mimicking food"

Early results captured the pasta texture but lacked the structural and architectural clarity needed to evoke buildings so try to make adjustments on the prompt "Food- inspired architecture"
Food:Pasta
prompt: "Food-inspired architecture, organic, pasta, ultra-realistic, 8K quality"
chaos:30 ar 16:9, v 6.1 , stylize 30, weird 3000

prompt: "Food-inspired architecture, organic, pasta, ultra-realistic, 8K quality"
chaos:20 ar 16:9, v 6.1 , stylize 20, weird 3000

prompt: "Food-inspired architecture, organic, pasta, ultra-realistic, 8K quality"
chaos:10 ar 16:9, v 6.1 , stylize 0, weird 3000

Prompt: "Food-inspired architecture, CITY, organic, pasta, ultra-realistic, 8K quality"
chaos:10 ar 16:9, v 6.1 , stylize 0, weird 3000
REMIX : STRONG VARY: STRONG

SAMPLE IMAGES:

RESULTS:
Prompt: "Food-inspired architecture, CITY, organic, pasta, ultra-realistic, 8K quality, make the pasta look more like buildings"

The fine tuning process demonstrates how slight prompt modifications parameters adjustments in Midjourney allowed me to shape the AI output incrementally, aligning the visual results closer to my conceptual vision.

Posted: **Mon Dec 09, 2024 2:58 pm**

for my presentation i wanted to experiment and answer a few questions specifically can Stable Diffusion produce accurate and culturally resonant visual outputs? by examining its ability to capture the essence of pop culture references. In addition, I investigate how different levels of CFG (Classifier-Free Guidance) influence the model's capacity to reflect tone, aesthetics, and context in its outputs. To ground this exploration, I analyze visual interpretations of brat, an album by Charli XCX, alongside My World 2.0 by Justin Bieber, highlighting how the model's settings shape its representation of these distinct cultural artifacts.

Media Arts and Technology

Project 5: Course Final Project

Project 5: Course Final Project

Re: Project 5: Course Final Project

Re: Project 5: Course Final Project

Re: Project 5: Course Final Project

Re: Project 5: Course Final Project

Re: Project 5: Course Final Project

Re: Project 5: Course Final Project

Re: Project 5: Course Final Project

Re: Project 5: Course Final Project

Re: Project 5: Course Final Project