Computing Infrastructure and Data Pipeline for Enterprise-scale Data Preparation: Scalability Optimization Study

Akhund, Sadig

Library MyADA ADA University

Home
→
CB5. ADA Theses, Dissertations and Final Projects
→
School of Information Technologies and Engineering
→
View Item

Computing Infrastructure and Data Pipeline for Enterprise-scale Data Preparation: Scalability Optimization Study

Akhund, Sadig

URI: http://hdl.handle.net/20.500.12181/926

Date: 2023-04

Abstract:

In today's data-driven landscape, enterprises face significant challenges in managing and processing massive amounts of data for meaningful insights and informed decision-making. Data preparation, a critical process that converts raw data into a usable format, plays a pivotal role in the data pipeline and significantly impacts downstream data analysis and modeling. However, traditional data preparation methods may struggle to keep up with the increasing volumes and complexity of data, leading to scalability issues, inefficiencies, delays, and suboptimal performance in the data pipeline. This thesis presents a comprehensive scalability optimization study that analyzes and optimizes the data preparation process in enterprise grade data pipelines. The study begins by analyzing common components of data pipelines and identifying limitations and bottlenecks that hinder scalability. It thoroughly examines existing data preparation methods, tools, and technologies, as well as cutting-edge tools and methodologies such as Apache Nifi, Apache Atlas, and Apache Spark for addressing scalability challenges. The research draws insights from literature, industry practices, and state-of-the-art technologies to propose practical strategies and recommendations for designing a scalable data strategy in an enterprise setting. The study provides actionable insights and recommendations to enhance the performance of data pipelines in enterprise grade data environments. The paper concludes with a summary of key findings, limitations, and future research directions, emphasizing the need for a well-designed data preparation pipeline that incorporates scalable data ingestion, efficient data transformation, and intelligent data storage strategies to ensure reliable and efficient data processing in enterprises dealing with large volumes of data.

Show full item record