Abstract:
This thesis explores the impact of single-threaded thinking on the performance and efficiency of Apache
Spark applications. Many developers transitioning from local computing models struggle to fully embrace Spark’s distributed nature, which leads to inefficient data partitioning, improper resource management, excessive shuffling, and overall performance loss. This research identifies these challenges,
quantifies their impact, and proposes best practices to optimize Spark’s distributed computing capabilities. The study also introduces cognitive frameworks to help developers shift from sequential execution
models to a parallel, distributed mindset. A mixed-methods approach is used to analyze and benchmark common inefficiencies, including repeated
computation due to misunderstanding lazy evaluation, failure to broadcast small datasets in joins, and
improper handling of data skew and partition sizing. Experiments measure execution time, memory usage, shuffle spill volumes, and CPU utilization before and after optimization. Datasets ranging from 10
million to 400 million records, including real financial transaction logs, are employed to ensure realistic
testing at scale. Findings show that applying distributed-first principles—such as custom partitioning,
pre-shuffle aggregation, and optimized caching—can improve execution time by up to 45%, reduce shuffle volume by 60%, and enhance memory efficiency by 30%. The study contributes actionable guidelines
for Spark developers and highlights the need for educational resources to aid this paradigm shift. By
addressing technical and cognitive barriers, this research ensures more efficient utilization of Spark in
large-scale data processing environments. This research provides actionable optimization guidelines for
Spark developers and proposes a cognitive framework for shifting from sequential to distributed thinking. This research underscores the human factors that impede the proper use of distributed systems,
examining cognitive and technical barriers. The intent of interpreting the obstacles faced by developers
(encountered in the transition from local to distributed) is to provide a better understanding of the problems when attempting to realize the value of frameworks for distributed data processing, such as Apache
Spark.