چكيده به لاتين
In recent years, particularly with the introduction of GPT-4, numerous vision-language models have been developed, demonstrating significant progress. These models are expected to understand both text and images. However, the datasets used for training these models often focus on simple in- terpretations of images. Some images, however, require a different level of human interpretation as they may convey multiple meanings. One such category includes images that create illusions of other images within themselves. In these cases, an illusion of another image is present. With an initial glance, we can easily comprehend what the image explicitly shows, but to understand the illusion within, we need to view it differently. For example, consider an image that initially appears to show trees in a forest, but upon closer inspection, reveals the illusion of a rabbit formed by the arrangement and colors of the elements.
Human perception of images with both an explicit and a hidden, illusory meaning is relatively in- tuitive; one simply needs to adjust their focus to detect the illusion. However, this aspect is under- explored in vision-language models. In this problem, the model receives an image as input and is expected to identify the illusion within it. Current state-of-the-art models face challenges in under- standing such images. For this, we created a dataset and improved models’ understanding of these types of images using two methods: one by fine-tuning models on these images and the other by ap- plying a simple filter inspired by the effect observed in human vision when squinting. To fine-tune models on these images, we collected a dataset with four subsets, each related to a different domain. Among these datasets, IllusionMNIST, IllusionFashionMNIST, and IllusionAnimals are used for classification, each containing 11 classes. To address cases beyond classification, we also utilized the IllusionChar dataset, relevant to OCR, consisting of 3 to 5 characters randomly arranged.
The statistics of each subset in this dataset are as follows: IllusionMNIST contains 3960 training samples and 1219 test samples; IllusionFashionMNIST includes 3300 training samples and 1267 test samples; IllusionAnimals has 3300 training samples and 1100 test samples; and IllusionChar comprises 9900 training samples and 3300 test samples.
Visual puzzles are another category that requires a broader perspective to grasp their meaning. For instance, consider an image with an arc of colors. At first glance, it may appear to show only the arc, but with some thought, one might deduce that the intended concept could be ”rainbow.” These images also necessitate looking beyond the surface and reasoning about the linguistic aspect.
The goal of this research is to create or collect datasets for each of the cases mentioned above and evaluate several state-of-the-art models on them. Experiments in this study reveal a significant performance gap between human understanding and model performance. For illusion images, we
propose a simple yet effective method using a low-pass filter on the image. Experiments demon- strate the effectiveness of this approach in detecting illusions in images, achieving an F1 score of 94.23, surpassing human performance of 91.5. This method requires minimal computational cost and does not necessitate retraining the model. In the best case, we achieved an F1 score of 94.35, again surpassing human performance.
For evaluating vision-language models on visual puzzles, we present a dataset of 253 samples col- lected and annotated manually, and we evaluated the performance of various state-of-the-art models on it. These experiments indicate a large performance disparity between open-source and closed- source models, with the highest F1 score of 84.19 achieved by the GPT-4o model.
The importance of addressing these issues is twofold: firstly, it enhances the models’ understanding and reasoning capabilities, and secondly, illusions can be leveraged in steganography.