ADA Library Digital Repository

From Local to Distributed: Avoiding Single-Threaded Thinking in Apache Spark Development

Show simple item record

dc.contributor.author Ismayilova, Kamila
dc.date.accessioned 2025-10-27T07:14:25Z
dc.date.available 2025-10-27T07:14:25Z
dc.date.issued 2025-04
dc.identifier.uri http://hdl.handle.net/20.500.12181/1502
dc.description.abstract This thesis explores the impact of single-threaded thinking on the performance and efficiency of Apache Spark applications. Many developers transitioning from local computing models struggle to fully embrace Spark’s distributed nature, which leads to inefficient data partitioning, improper resource management, excessive shuffling, and overall performance loss. This research identifies these challenges, quantifies their impact, and proposes best practices to optimize Spark’s distributed computing capabilities. The study also introduces cognitive frameworks to help developers shift from sequential execution models to a parallel, distributed mindset. A mixed-methods approach is used to analyze and benchmark common inefficiencies, including repeated computation due to misunderstanding lazy evaluation, failure to broadcast small datasets in joins, and improper handling of data skew and partition sizing. Experiments measure execution time, memory usage, shuffle spill volumes, and CPU utilization before and after optimization. Datasets ranging from 10 million to 400 million records, including real financial transaction logs, are employed to ensure realistic testing at scale. Findings show that applying distributed-first principles—such as custom partitioning, pre-shuffle aggregation, and optimized caching—can improve execution time by up to 45%, reduce shuffle volume by 60%, and enhance memory efficiency by 30%. The study contributes actionable guidelines for Spark developers and highlights the need for educational resources to aid this paradigm shift. By addressing technical and cognitive barriers, this research ensures more efficient utilization of Spark in large-scale data processing environments. This research provides actionable optimization guidelines for Spark developers and proposes a cognitive framework for shifting from sequential to distributed thinking. This research underscores the human factors that impede the proper use of distributed systems, examining cognitive and technical barriers. The intent of interpreting the obstacles faced by developers (encountered in the transition from local to distributed) is to provide a better understanding of the problems when attempting to realize the value of frameworks for distributed data processing, such as Apache Spark. en_US
dc.language.iso en en_US
dc.publisher ADA University en_US
dc.rights Attribution-NonCommercial-NoDerivs 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/3.0/us/ *
dc.subject Data partitioning (Computer science) en_US
dc.subject Cognitive psychology -- Computer programming en_US
dc.subject Software engineering -- Best practices en_US
dc.subject Computer systems -- Performance en_US
dc.subject Parallel processing (Electronic computers) en_US
dc.title From Local to Distributed: Avoiding Single-Threaded Thinking in Apache Spark Development en_US
dc.type Thesis en_US
dcterms.accessRights Absolute Embargo (No access without the author's permission)


Files in this item

Files Size Format View

There are no files associated with this item.

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 United States Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States

Search ADA LDR


Advanced Search

Browse

My Account