From Local to Distributed: Avoiding Single-Threaded Thinking in Apache Spark Development

Ismayilova, Kamila

Library MyADA ADA University

Home
→
CB5. ADA Theses, Dissertations and Final Projects
→
School of Information Technologies and Engineering
→
View Item

From Local to Distributed: Avoiding Single-Threaded Thinking in Apache Spark Development

Ismayilova, Kamila

URI: http://hdl.handle.net/20.500.12181/1502

Date: 2025-04

Abstract:

This thesis explores the impact of single-threaded thinking on the performance and efficiency of Apache Spark applications. Many developers transitioning from local computing models struggle to fully embrace Spark’s distributed nature, which leads to inefficient data partitioning, improper resource management, excessive shuffling, and overall performance loss. This research identifies these challenges, quantifies their impact, and proposes best practices to optimize Spark’s distributed computing capabilities. The study also introduces cognitive frameworks to help developers shift from sequential execution models to a parallel, distributed mindset. A mixed-methods approach is used to analyze and benchmark common inefficiencies, including repeated computation due to misunderstanding lazy evaluation, failure to broadcast small datasets in joins, and improper handling of data skew and partition sizing. Experiments measure execution time, memory usage, shuffle spill volumes, and CPU utilization before and after optimization. Datasets ranging from 10 million to 400 million records, including real financial transaction logs, are employed to ensure realistic testing at scale. Findings show that applying distributed-first principles—such as custom partitioning, pre-shuffle aggregation, and optimized caching—can improve execution time by up to 45%, reduce shuffle volume by 60%, and enhance memory efficiency by 30%. The study contributes actionable guidelines for Spark developers and highlights the need for educational resources to aid this paradigm shift. By addressing technical and cognitive barriers, this research ensures more efficient utilization of Spark in large-scale data processing environments. This research provides actionable optimization guidelines for Spark developers and proposes a cognitive framework for shifting from sequential to distributed thinking. This research underscores the human factors that impede the proper use of distributed systems, examining cognitive and technical barriers. The intent of interpreting the obstacles faced by developers (encountered in the transition from local to distributed) is to provide a better understanding of the problems when attempting to realize the value of frameworks for distributed data processing, such as Apache Spark.

Show full item record

Files in this item

Files	Size	Format	View
There are no files associated with this item.

The following license files are associated with this item:

Creative Commons

This item appears in the following Collection(s)

School of Information Technologies and Engineering

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States

From Local to Distributed: Avoiding Single-Threaded Thinking in Apache Spark Development

From Local to Distributed: Avoiding Single-Threaded Thinking in Apache Spark Development

Abstract:

Files in this item

This item appears in the following Collection(s)

Search ADA LDR

Browse

All of ADA LDR

This Collection

My Account