From Local to Distributed: Avoiding Single-Threaded Thinking in Apache Spark Development

Ismayilova, Kamila

Home
→
CB5. ADA Theses, Dissertations and Final Projects
→
School of Information Technologies and Engineering
→
View Item

dc.contributor.author	Ismayilova, Kamila
dc.date.accessioned	2025-10-27T07:14:25Z
dc.date.available	2025-10-27T07:14:25Z
dc.date.issued	2025-04
dc.identifier.uri	http://hdl.handle.net/20.500.12181/1502
dc.description.abstract	This thesis explores the impact of single-threaded thinking on the performance and efficiency of Apache Spark applications. Many developers transitioning from local computing models struggle to fully embrace Spark’s distributed nature, which leads to inefficient data partitioning, improper resource management, excessive shuffling, and overall performance loss. This research identifies these challenges, quantifies their impact, and proposes best practices to optimize Spark’s distributed computing capabilities. The study also introduces cognitive frameworks to help developers shift from sequential execution models to a parallel, distributed mindset. A mixed-methods approach is used to analyze and benchmark common inefficiencies, including repeated computation due to misunderstanding lazy evaluation, failure to broadcast small datasets in joins, and improper handling of data skew and partition sizing. Experiments measure execution time, memory usage, shuffle spill volumes, and CPU utilization before and after optimization. Datasets ranging from 10 million to 400 million records, including real financial transaction logs, are employed to ensure realistic testing at scale. Findings show that applying distributed-first principles—such as custom partitioning, pre-shuffle aggregation, and optimized caching—can improve execution time by up to 45%, reduce shuffle volume by 60%, and enhance memory efficiency by 30%. The study contributes actionable guidelines for Spark developers and highlights the need for educational resources to aid this paradigm shift. By addressing technical and cognitive barriers, this research ensures more efficient utilization of Spark in large-scale data processing environments. This research provides actionable optimization guidelines for Spark developers and proposes a cognitive framework for shifting from sequential to distributed thinking. This research underscores the human factors that impede the proper use of distributed systems, examining cognitive and technical barriers. The intent of interpreting the obstacles faced by developers (encountered in the transition from local to distributed) is to provide a better understanding of the problems when attempting to realize the value of frameworks for distributed data processing, such as Apache Spark.	en_US
dc.language.iso	en	en_US
dc.publisher	ADA University	en_US
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	*
dc.subject	Data partitioning (Computer science)	en_US
dc.subject	Cognitive psychology -- Computer programming	en_US
dc.subject	Software engineering -- Best practices	en_US
dc.subject	Computer systems -- Performance	en_US
dc.subject	Parallel processing (Electronic computers)	en_US
dc.title	From Local to Distributed: Avoiding Single-Threaded Thinking in Apache Spark Development	en_US
dc.type	Thesis	en_US
dcterms.accessRights	Absolute Embargo (No access without the author's permission)

Files in this item

Files	Size	Format	View
There are no files associated with this item.

The following license files are associated with this item:

Creative Commons

This item appears in the following Collection(s)

School of Information Technologies and Engineering

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States

From Local to Distributed: Avoiding Single-Threaded Thinking in Apache Spark Development

Files in this item

This item appears in the following Collection(s)

Search ADA LDR

Browse

All of ADA LDR

This Collection

My Account