مهدي خورشا

عنوان

تطبيق كليشه با استفاده از تنظيم اعلان مدل هاي تشخيص اشياء با واژگان باز

مقطع تحصيلي

كارشناسي ارشد

رشته تحصيلي

مهندسي كامپيوتر

سال تحصيل

1401

تاريخ دفاع

1404/07/30

استاد راهنما

محمدرضا محمدي

استاد مشاور

دانشكده

مهندسي كامپيوتر

چكيده

در سال‌هاي اخير، رشد چشمگير مدل‌هاي بنيادي بينايي مسير حل مسائل كلاسيك بينايي كامپيوتر مانند تطبيق كليشه را متحول ساخته است. تطبيق كليشه كه نقش محوري در وظايفي چون تشخيص و رديابي دارد، در روش‌هاي سنتي مبتني بر مقايسه‌ي هندسي و پيكسلي با چالش‌هايي نظير تغيير مقياس و كمبود تعميم‌پذيري روبه‌روست. در اين پژوهش، چارچوبي نوين مبتني بر مدل‌هاي بنيادي ارائه شد كه به‌جاي تطبيق پيكسلي، از بازنمايي‌هاي معنايي براي هم‌ترازي مفهومي ميان تصوير كليشه و جست‌وجو بهره مي‌گيرد. در اين راستا، براي اولين بار چارجوبي ارائه شد تا بتوان براي مسئله‌ي تطبيق كليشه راهكار مبتني بر بهينه‌سازي ارائه شود. به اين ترتيب كه بااستفاده از شبكه‌ي MSDNet كه يك مدل بينايي بنيادي مي‌باشد نواحي‌اي كه احتمال حضور كليشه در تصوير جست‌وجو درآن بيشينه است را پيشنهاد مي‌دهد. مدل اين نواحي را براساس اعلاني كه كاربر به او از طريق تصوير كليشه اعلام كرده است از تصوير جست‌وجو استخراج مي نمايد. به‌دليل اينكه مدل پيشنهاددهنده امكان پيشنهاد چندين ناحيه را دارد از يك ماژول بازرتبه‌بند نيز استفاده شده است تا مقايسه‌ي معنايي و مفهومي ميان كليشه و نواحي پيشنهادي انجام شده و برترين ناحيه به‌عنوان ناحيه‌ي نهايي بازگردانده شود. در اين چارچوب چالش اصلي يافتن تطبيق كليشه پيشنهاد ناحيه‌هاي بهينه مي‌باشد كه براي بهبود نتايج مي توان مدل MSDNet را كه برپايه‌ي مدل SAM مي‌باشد با تعداد محدودي نمونه‌ي آموزشي و با آموزش شاخه‌ي اعلان آن تنظيم دقيق كرد. به گونه‌اي كه در اين پژوهش تنها با 84 جفت داده‌ي آموزشي، در مقايسه با مدل پايه در معيار AUC بر روي مجموعه‌داده‌ي BBC بيش از 10 % و بر روي مجموعه‌داده‌ي KTM در حدود 5 % پيشرفت داشته است. به‌طور كلي، اين پژوهش نشان مي‌دهد كه تركيب بازنمايي‌هاي غني مدل‌هاي بنيادي با فرآيند تنظيم دقيق هدفمند، مي‌تواند راهكاري مؤثر براي حل مسئله‌ي تطبيق كليشه و توسعه‌ي چارچوب‌هاي تعميم‌پذيرتر در وظايف مشابه بينايي كامپيوتر فراهم كند.

تاريخ ورود اطلاعات

1405/02/13

عنوان به انگليسي

Template matching by pro‎mp‎t tuning of open vocabulary object detection models

تاريخ بهره برداري

4/21/2026 12:00:00 AM

دانشجوي وارد كننده اطلاعات

مهدي خورشا

Name: مهدي خورشا
Author: مهدي خورشا

چكيده به لاتين

In recent years, the remarkable growth of vision foundation models has fundamentally transformed the way classical computer vision problems, such as template matching, are addressed. Template matching, which plays a central role in tasks including detection an‎d tracking, faces challenges in traditional approaches based on geometric an‎d pixel-wise comparisons, particularly with respect to scale variation an‎d limited generalization capability. In this study, a novel framework based on vision foundation models is proposed that, instead of relying on pixel-level matching, leverages semantic representations to achieve conceptual alignment between the template image an‎d the search image. To this end, for the first time, a framework is introduced that enables an optimization-based solution to the template matching problem. Specifically, by employing the MSDNet network, which is a vision foundation model, the framework proposes regions in the search image where the likelihood of the template’s presence is maximized. Based on a pro‎mp‎t provided by the user through the template image, the model extracts the corresponding regions from the search image. Given that the proposal model can generate multiple can‎didate regions, a re-ranking module is incorporated to perform semantic an‎d conceptual comparisons between the template an‎d the proposed regions, ultimately returning the most relevant region as the final result. Within this framework, the main challenge lies in identifying optimal region proposals for template matching. To further improve performance, the MSDNet model, which is built upon the SAM model, can be fine-tuned with a limited number of training samples by training only its pro‎mp‎t branch. As demonstrated in this research, using only 84 pairs of training data leads to improvements of more than 10 % in terms of AUC on the BBC dataset an‎d approximately 5 % on the KTM dataset compared to the base model. Overall, this study shows that combining the rich representations of foundation models with targeted fine-tuning can provide an effective solution to the template matching problem an‎d facilitate the development of more generalizable frameworks for related computer vision tasks.

كليدواژه هاي فارسي

تنظيم اعلان , تشخيص اشيا با واژگان باز , مدل‌هاي بنيادي بينايي , تطبيق كليشه

كليدواژه هاي لاتين

pro‎mp‎t tuning , open vocabulary object detection , vision foundation models , template matching

Author

Mahdi Khoursha

SuperVisor

MohammadReza Mohammadi

لينک به اين مدرک

https://dl.iust.ac.ir/dl/search/default.aspx?Term=34772&Field=0&DTC=6