Report 6: Vision & Machine Learning

glegrady · Post by **glegrady** » Mon Oct 05, 2020 2:00 pm

MAT594GL Techniques, History & Aesthetics of the Computational Photographic Image
https://www.mat.ucsb.edu/~g.legrady/aca ... f594b.html

Please provide a response to any of the material covered in this week's two presentations by clicking on "Post Reply". Consider this to be a journal to be viewed by class members. The idea is to share thoughts, other information through links, anything that may be of interest to you and the topic at hand.

Report for this topic is due by November 24, 2020 but each of your submissions can be updated throughout the length of the course.

k_parker · Post by **k_parker** » Tue Dec 01, 2020 9:03 am

This week I was drawn to the readings: Inceptionism: Going Deeper into Neural Networks, Alexander Mordvintsev (2015), The Machine Vision, Paul Virilio (1994), and Excavating AI, Kate Crawford and Trevor Paglen (2019). The articles together provide a chronology of process: capturing reality, creating a duplicity of existence- the virtual and the lived reality, CNNs decoding the virtual reality, CNNs determining what it is seeing and labeling virtual reality based on a flawed and bias dataset, and finally the element of “deep dream” where CNN neurons can find whatever they are trying to look for even when it is not there.

I first reviewed Mordvintsev paper discussing “deep dreams” and CNN's ability to find, fabricate, and/or decipher images in clouds. This is undoubtedly complicated with Crawford and Paglen work revealing the problematic biases of data sets that are used to train and evaluate AI. “Images are remarkably slippery things, laden with multiple potential meanings, irresolvable questions, and contradictions. Entire subfields of philosophy, art history, and media theory are dedicated to teasing out all the nuances of the unstable relationship between images and meanings”(Crawford and Paglen).

However, I find something really lovely about CNNs fabricating images based on random noise or something non-specific like cloud formations. Though perhaps my bias is that I am drawn to a uniquely flawed and human like AI rather than what AI has the potential to become. A neuron specifically designed to set out and look for a specific object/feature (though based on a deeply flawed data set) and in each example provided in the text and in lecture, that neuron can be manipulated to produce what it was looking for. When not presented with the information the neuron was after it produced features that were not there.

I believe this ties in nicely to the argument Virilio is making about the duality of reality with live capture; “paradoxical logic emerges when the real-time image dominates the thing represented, real time subsequently prevailing over real space, virtuality dominating actuality and turning the very concept of reality on its head”(Virilio).

The fabrication of images in deep dream neuron isolation does not alter/decipher reality but instead a second virtual reality. And while deciphering this second reality looking for a specific image- the neuron, in each example shown, shows a multitube of those images. It seems to me that AI with CNNs must exist, and operate with this duplicity.
“philosophical question of the splitting of viewpoint, the sharing of perception of the environment between the animate (the living subject) and the inanimate (the object, the seeing machine)”(Virilio).
“where TRUE and FALSE are no longer relevant. The actual and the virtual have gradually taken their place”(Virilio).

This is of course further complicated with Crawford and Paglen’s discussion on training sets using performed and not actual emotions: “These, of course, are all ‘performed” expressions—not relating to any interior state, but acted out in a laboratory setting”(Crawford and Paglen).
This action further reinforces the “second” reality of the virtual. Virtual humans exist as something that can be determined and reduced inside of a strict structural classification to nouns. It is my opinion that a potential solution is also in duplicity. The mixed categorization of images so that an image is labeled and exists as a multitude of things. (I am thinking of Weihao's demonstration where an image is determined to be something but there are percentages where an image can be something else). This certainly does not solve the hidden biases in synthetic perception, but could add uncertainty which, in my opinion, is much closer to lived experience. However, this might not be practical in application.

When thinking about the duality of computational perception, I am immediately drawn to the work of Kasimir Malevich’s and his manifesto. The manifesto attempts to go below the threshold of symbolism: below representation altogether- as if only when void of naturalistic representation is artistic production then defined as “absolute creation” in the context of Malevich’s From Cubism and Futurism to Suprematism: The New Realism in Painting (1916). The notions of naturalistic singularity and totality in preexisting representation is questioned and denounced in favor for a multiplicity that can account for compression of time and compression of space into a singular plane. Which finds its relative match in the compression of reality shown in computational perception. I believe Masood also mentioned the connection from fascist art movements to our discussion on computational aesthetics where computational aesthetics wished to remove the image from context.

With Suprematism, it is in the animation (or conception of it to be so) of the picture plane itself that is key to Malavich: “a painted surface is a real, living form. Intuitive feeling is now becoming conscious, not longer is it subconscious. Or even, rather, the other way round - it was always conscious, only the artist was unable to interpret its demands”. Later in the piece, Malavich reiterates, “the square is not a subconscious form. It is the creation of intuitive reason”. I believe this conceptually translates to the active plasticity and duality in computational perception.

ehrenzeller · Post by **ehrenzeller** » Wed Dec 09, 2020 9:04 am

A few things spoke out to me this week. First, I appreciated the discussion of Baldessari’ s use of human gaze in his subjects, directing the attention of the viewer around his photographs. This is often something I strive to accomplish in my collage work, using the elements of photos I excise to bounce the viewer between various sections.

Weihao’s CNN presentation did a great job explaining something I was pretty apprehensive about learning, in fear of getting lost in the technical jargon, and it certainly left me with more of an understanding of feature recognition and training machines to recognize similarities in data sets.

Lastly, I found “Excavating AI The Politics of Images in Machine Learning Training Sets” by Kate Crawford and Trevor Paglen to be a bit troubling when it came to labeling human data sets. Though the 1980’s WordNet database may be to blame for its inclusion of labels like “Pseudohermaphrodite”, “Switch Hitter”, “Cocksucker”, “Pervert”, “Call Girl”, “Drug Addict”, “Closet Queen”, “Convict”, etc., I don’t see how using these terms to describe photo data can possibly lead to anything positive.

In fact, inclusion of these terms in machine learning will only teach these machines to propagate the narrow minded, abrasive culture of exclusion, humanity is finally taking beginning to move on past. Through continued use, programmers are able to pin the blame on non-living entities for coming up with these results. I don’t see why the computers can’t just come up with the groupings based on feature recognition, then survey a diverse sample of humans to assign titles, which could then be grouped to create a new taxonomic language in order to describe the photos. Language evolves as much as culture does. This is especially concerning what features we focus on. (Is it more important that the man is black? Short? Wealthy? A scientist?) How the computer recognizes a “Corinthian” today (another WordNet category) will undoubtedly be different decades from now.

merttoka · Post by **merttoka** » Fri Dec 11, 2020 5:34 am

This week, I appreciated the discussion about human vision and machine learning methods. Interestingly, I recently came across an article that improves the robustness of object detection methods using CNNs by mimicking the visual cortex of human brains.

The authors mention that the current object detection models are prone to tiny perturbations in the input image. These perturbations could be so small that the human eye cannot differentiate them. In the above image, the first image of the cat is correctly identified as a cat. Yet, after modifying the input model slightly, the same network recognizes it as a sleeping bag. This demonstrates the vulnerability of such models.

"It’s thought that V1 detects local edges or contours of objects, and textures, and does some type of segmentation of the images at a very small scale. Then that information is later used to identify the shape and texture of objects downstream."

To address this, their proposed network, VOneNet, prepends another layer in the traditional CNN architecture that mimics the first stage of visual processing in the cortex (V1). This stage is perfect for pattern recognition, and it highlights important features that guide attention and gaze shifting. Even though I don't fully understand its implementation, it looks like they are dividing the pattern recognition into different parts that can process simple and complex cells independently.

"We can use this as a tool for novel neuroscientific discoveries, and also continue developing this model to improve its performance under this challenging task."

zhangweidilydia · Post by **zhangweidilydia** » Wed Dec 16, 2020 2:43 am

Two artworks I recently resonated with -
Someone by Lauren Lee McCarthy

https://lauren-mccarthy.com/LAURENI attempt to become a human version of Amazon Alexa, a smart home intelligence for people in their own homes. The performance lasts up to a week. It begins with an installation of a series of custom designed networked smart devices (including cameras, microphones, switches, door locks, faucets, and other electronic devices). I then remotely watch over the person 24/7 and control all aspects of their home. I aim to be better than an AI because I can understand them as a person and anticipate their needs. The relationship that emerges falls in the ambiguous space between human-machine and human-human.

https://immerse.news/feeling-at-home-be ... 47561e7f04
https://lauren-mccarthy.com/LAUREN

and this works reminds me of XuBing's Dragon Fly Eyes

Few images come closer to reality than those recorded by surveillance cameras. In China, a country with strict film censorship, an estimated 200 million such cameras have been installed to capture life unfiltered; mundane daily activities are mixed with dramatic events beyond the realm of imagination. Visual artist Xu Bing’s first feature film stitches together surveillance footage collected from the Internet to create a fictional tale about a young woman traversing life in modern China. The result is a provocative tale as mundane, surreal, and outlandish as reality itself. Known for works that consistently disrupt our understanding of what we see—from Book from the Sky, an installation of books and scrolls with printed “fake” Chinese characters, to Phoenix, giant phoenix sculptures made of salvaged materials—Xu persistently explores the relationship between vision and meaning.

link:https://www.moma.org/calendar/film/5009

Those two works are connected in an interesting way and let me think about the question of who is behind the eyes of machine?

wqiu · Post by **wqiu** » Wed Dec 16, 2020 2:27 pm

I have been reading David Marr's book, Vision. It surprised me that how much similarity the principle of neural networks similar to human perception.

In the book, Marr discussed how information is extracted from visual signals. Starting with seeing the image, followed by a primal sketch, 2.5 D sketch, and eventually 3D model representation. If compared to how CNN recognized an image, the pixels values are processed through layers of neurons, the lower layers of which extracts the low level features, and the complexities of the extracted feature increases as the layers extends. Eventually, the information describing the image scene is extracted.

Another parallel is the the chapter of Purpose of Vision, where Marr compared the vision system of different species.

Many types of jumping spider user vision to tell the difference between a potential meal and a potential mate. One type, for example, has a curious retina formed of two diagonal strips arranged in a V. If it detects a red V on the back of an object lying in front of it, the spider has found a mate. Other wise, maybe a meal. The frog, as we have seen, detects bugs with its retina; and the rabbit retina is full of special gadgets, including what is apparently a hawk detector, since it responds well to the pattern made by a preying hawk hovering overhead.

This paragraph made me think of the Theory of Evolution that the animal species's vision system is trained by nature over the evolution history. How similar this is to neural network training! When training the neural networks, we iteratively feed examples with labels to the neural networks, and correct them if they behave wrongly. After iterations of training, many neurons, or groups of neurons, which ire similar to the "gadgets" of those animals vision system, are formed naturally to accomplish specialized recognition tasks.

The usefulness of a representation depends upon how well suited it is to the purpose for which its is used

With the help of different "gadgets", images are represented differently in different species. This concept is powerful, because we can then construct new neural networks from the perspective of representation, rather than recognition. Especially, when the network is to be used for create art, it should yield a distinct image representation from what is yield by a network trained for pure computer vision purpose. On the other hand, artists can seek to understand not only the functionality, or the initial purpose, of an existing neural network, but the representation yield by the network, and repurpose it for artistic creation. This representation is indirectly embedded in the "black box" of neuron activations, and waiting for people to discover.

yichenli · Post by **yichenli** » Wed Dec 16, 2020 5:59 pm

I watched a video about autostereograms and their difference from stereograms.
https://www.youtube.com/watch?v=v8O8Em_RPNg

What I also found interesting is how the numbers of times that an object is repeated on the image can be manipulated to convey depth, and the fact that some of the early autosteregrams are hand-drafted, which means that in this example, there was only a rectangle:

Afterwards, engineers came up with algorithms for generating more complicated scenes:

The fact that depth can be without particular content, yet still be detected by people is really interesting, it drew me to look at other tools that convey attributes rather than the content of vision, such as this project called Seeing With Sound, a "bifocal" soundscape for the blind:
https://www.youtube.com/watch?v=CQ4RPR3ETPY

The software seem to "scan" a scene from the center to the left and right and sonify brightness data instead of in layers or depth. This reminds me of fax machines, which are very different from how seeing people perceive things. It makes sense because if the scene is scanned in layers by depth, the objects' locations would be hard to encode, and the sounds would not be very informative to the user. But I am still curious about whether there is research going on that does not scan images that way.

Media Arts and Technology

Report 6: Vision & Machine Learning

Report 6: Vision & Machine Learning

Re: Report 6: Vision & Machine Learning

Re: Report 6: Vision & Machine Learning

Re: Report 6: Vision & Machine Learning

Re: Report 6: Vision & Machine Learning

Re: Report 6: Vision & Machine Learning

Re: Report 6: Vision & Machine Learning