چكيده به لاتين
Cancer is one of the leading causes of death worldwide. The main factor in reducing mortality from this disease is early and accurate diagnosis. Therefore, by utilizing the data of cancer patients that have been collected in significant numbers in recent years, necessary diagnoses regarding the heredity of cancer can be made, and based on the derived rules, preventive and practical recommendations can be provided. Golestan Province has long been known as a high-risk area for cancer, especially gastrointestinal cancers including liver, small intestine, large intestine, esophagus, stomach, and colorectal cancers. Early detection is one of the key points in cancer control. Considering the high incidence of cancers and the role of the cancer biobank in designing practical studies, the Golestan Cancer Biobank project was designed to create an appropriate platform for the design and implementation of applied research projects in the field of cancer. The Golestan Cohort Project is one of the rich and valuable research infrastructures in Golestan Province, which started in 2005. This project is being implemented on 50,000 residents of the eastern region of Golestan Province, and its main objective is to identify the risk factors for upper gastrointestinal cancers. In this thesis, first, the variables were determined by an expert for use in models and algorithms. Then the data were prepared for entry into the algorithm, categorized, and labeled. In the next step, missing values were replaced, and the data were harmonized and mixed. The gastrointestinal cancer data, which consisted of 733 cases, were separated. Furthermore, to improve the results, variables were categorized thematically, and dimensionality reduction was performed on each variable group using clustering. In this research, a developed algorithm called RApriori was introduced, which eliminates many redundant rules and rationalizes the resulting patterns through the definition of a posterior set. The algorithms RApriori, FP-growth, ECLAT, and K-modes clustering were implemented and compared on the prepared data in terms of output usefulness, interpretability, and execution time. the RApriori, K-modes clustering, and FP-growth algorithms showed satisfactory performance. Totaly , the analysis of the outputs and according to the analysis and application of the expert's opinion, some useful rules were obtained on the issue of the relationship between esophageal and colon cancer.