| dc.contributor.author | Ismayilova, Kamila | |
| dc.date.accessioned | 2025-10-27T07:14:25Z | |
| dc.date.available | 2025-10-27T07:14:25Z | |
| dc.date.issued | 2025-04 | |
| dc.identifier.uri | http://hdl.handle.net/20.500.12181/1502 | |
| dc.description.abstract | This thesis explores the impact of single-threaded thinking on the performance and efficiency of Apache Spark applications. Many developers transitioning from local computing models struggle to fully embrace Spark’s distributed nature, which leads to inefficient data partitioning, improper resource management, excessive shuffling, and overall performance loss. This research identifies these challenges, quantifies their impact, and proposes best practices to optimize Spark’s distributed computing capabilities. The study also introduces cognitive frameworks to help developers shift from sequential execution models to a parallel, distributed mindset. A mixed-methods approach is used to analyze and benchmark common inefficiencies, including repeated computation due to misunderstanding lazy evaluation, failure to broadcast small datasets in joins, and improper handling of data skew and partition sizing. Experiments measure execution time, memory usage, shuffle spill volumes, and CPU utilization before and after optimization. Datasets ranging from 10 million to 400 million records, including real financial transaction logs, are employed to ensure realistic testing at scale. Findings show that applying distributed-first principles—such as custom partitioning, pre-shuffle aggregation, and optimized caching—can improve execution time by up to 45%, reduce shuffle volume by 60%, and enhance memory efficiency by 30%. The study contributes actionable guidelines for Spark developers and highlights the need for educational resources to aid this paradigm shift. By addressing technical and cognitive barriers, this research ensures more efficient utilization of Spark in large-scale data processing environments. This research provides actionable optimization guidelines for Spark developers and proposes a cognitive framework for shifting from sequential to distributed thinking. This research underscores the human factors that impede the proper use of distributed systems, examining cognitive and technical barriers. The intent of interpreting the obstacles faced by developers (encountered in the transition from local to distributed) is to provide a better understanding of the problems when attempting to realize the value of frameworks for distributed data processing, such as Apache Spark. | en_US |
| dc.language.iso | en | en_US |
| dc.publisher | ADA University | en_US |
| dc.rights | Attribution-NonCommercial-NoDerivs 3.0 United States | * |
| dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/3.0/us/ | * |
| dc.subject | Data partitioning (Computer science) | en_US |
| dc.subject | Cognitive psychology -- Computer programming | en_US |
| dc.subject | Software engineering -- Best practices | en_US |
| dc.subject | Computer systems -- Performance | en_US |
| dc.subject | Parallel processing (Electronic computers) | en_US |
| dc.title | From Local to Distributed: Avoiding Single-Threaded Thinking in Apache Spark Development | en_US |
| dc.type | Thesis | en_US |
| dcterms.accessRights | Absolute Embargo (No access without the author's permission) |
| Files | Size | Format | View |
|---|---|---|---|
|
There are no files associated with this item. |
|||
The following license files are associated with this item: