رضا علي پور

عنوان

بهره گيري از سيستم فايل توزيع شده هدوپ در پردازش و تحليل كلان داده هزينه و درآمد خانواري كشور

مقطع تحصيلي

كارشناسي ارشد

رشته تحصيلي

مهندسي كامپيوتر

سال تحصيل

1399

تاريخ دفاع

1401/06/30

استاد راهنما

دكتر رضا انتظاري ملكي

دانشكده

كامپيوتر

چكيده

كلان داده از منابع مهم در دنياي امروز است، كه با استفاده از تجزيه و تحليل هاي گوناگوني كه روي آن انجام ميگيرد اطلاعات و دانش ارزشمندي از آن بدست مي آيد. طي دو دهه ي اخير حجم اين داده ها در حال گسترش بوده و رفته رفته برحجم آن نيز افزوده مي شود. چارچوب هدوپ براي توزيع و پردازش كلان داده يكي از پركاربردترين ابزار است كه با زبان برنامه نويسي جاوا نوشته شده است. هدوپ يك ابزار مناسب است كه اين امكان را مي دهد تا پردازش برروي مجموعه داده هاي بزرگ با خوشه بندي انجام پذيرد و مديريت داده هاي نيمه ساختاريافته و ساختارنيافته را تسهيل كند. در ايران نيز هم چون كشورهاي ديگر هر ساله در حوزه آمارهاي رسمي كشور داده هاي خانواري جمع آوري مي شود.اين داده ها حاوي اطلاعات ارزشمندي است كه نتايج آن فقط در سطح كل كشور و استان منتشر مي شود و تاكنون در سطح شهرستان نتايج و اطلاعاتي استخراج نشده است. هدف اين تحقيق استفاده از چارچوب هدوپ براي توزيع و پردازش داده هاي خانواري در سطح شهرستان هاي استان است، سپس اطلاعات استخراج شده براي تجزيه و تحليل مورد استفاده قرارمي گيرد. براساس مدل پيشنهادي، خوشه بندي داده هاي 31 استان كشور در 4 خوشه انجام و براي راه اندازي 4 سرور ماشين مجازي با 4 گره در نظر گرفته شد. دادهخام از sql به csv تبديل و در فايل هاي HDFS بارگذاري و عمليات نگاشت/كاهش انجام شد. بدين ترتيب و براساس اهداف اين تحقيق خروجي هاي مورد نظر مانند متوسط هزينه ي ارتباطات يك خانوار و شاخص ها برخورداري يك خانوار مانند اينترنت در سطح شهرستان هاي استان 01 استخراج شد و مقايسه ها نيز نشان داده شد. بديهي است كه همين اطلاعات و شاخص ها ميتواند در سطح وسيع تر و در سطح شهرستان هاي استان هاي ديگر و حتي در سطح روستايي نيز استخراج شده و مورد تجزيه و تحليل قرارگيرد. با توجه به نتايج اين تحقيق پيشنهاد مي شود، با بكارگيري سيستم فايل توزيع شده هدوپ، داده هاي خانواري را سريعتر از گذشته آماده كرد كه اكنون بصورت متمركز، آفلاين و با تاخيرجمع آوري مي شود. با ارايه بهنگام خروجي ها و اطلاعات ميتوان تحليل هاي سريعتر وبهتري را نسبت به گذشته انجام داد. همچنين پيشنهاد مي شود با بكارگيري سيستم توزيع شده هدوپ بتوان بين اطلاعات استخراج شده سالانه خانواري در سطح شهرستان با اطلاعات سرشماري جمعيتي كشور ارتباط برقرار كرده و خلاي آماري و شاخص هاي برخورداري خانوار را تكميل كرد.

تاريخ ورود اطلاعات

1401/08/15

عنوان به انگليسي

Hadoop Distributed File System Application in Analysis of Households Income and Expenditure Big Data

تاريخ بهره برداري

9/21/2023 12:00:00 AM

دانشجوي وارد كننده اطلاعات

رضا علي پور

Name: رضا علي پور
Author: رضا علي پور

چكيده به لاتين

Big data is one of the most important resources in today's world, from which valuable information and knowledge is obtained by using various analyzes that are performed on it. Over the last two decades, the volume of this data has been expanding and its volume is gradually increasing. The Hadoop framework for distributing and processing metadata is one of the most widely used tools written in the Java programming language. Hadoop is a convenient tool that allows the processing of large data sets with clustering and facilitates the management of semi-structured and unstructured data. In Iran, as in other countries, Household data is collected every year in the field of official statistics. These data contain valuable information, the results of which are published only in the whole country and province, and so far no results and information have been extracted in the city. The purpose of this study is to use the Hadoop framework for the distribution and processing of household data in the cities of the province, then the extracted information is used for analysis. Based on the proposed model, data clustering of 31 provinces of the country was done in 4 clusters and 4 virtual machine servers with 4 nodes were considered. The raw data was converted from sql to CSV and uploaded into HDFS files and then Map/Reduce operations were performed. Therefore, based on the objectives of this research, the outputs such as the average communication expenditure of a household and the Internet indicator at the level of the cities of 01 province were extracted and the comparisons were also shown. It is obvious that the same information and indicators can be extracted and analyzed at a wider level, at the level of other cities of other provinces and even at the village level. According to the results of this research, it is suggested that by using the Hadoop distributed file system, household data can be prepared faster than in the past, which is now collected centrally, offline and with a delay. By providing timely outputs and information, faster and better analyzes can be performed than in the past. It is also suggested that by using the Hadoop distributed system, it will be possible to establish a relationship between the extracted annual household information at the city level with the population census information of the country and fill the statistical gap and household access indicators.

كليدواژه هاي فارسي

چارچوب هدوپ , سيستم فايل توزيع شده , نگاشت كاهش , كلان داده , داده هاي خانواري

كليدواژه هاي لاتين

Hadoop framework , Hadoop Distributed File System , MapReduce , Big Data , Household Data

Author

Reza Alipour

SuperVisor

Dr Reza Entezari

لينک به اين مدرک

https://dl.iust.ac.ir/dl/search/default.aspx?Term=27282&Field=0&DTC=6