قیمت و خرید کتاب High Performance Spark

ثبت نام / ورود

نام کتاب

ثبت نام / ورود

کتاب‌های آماده | تحویل فوری

نام کتاب

/برنامه نویسی/دیتابیس‌ها/Apache Spark

High Performance Spark

Best Practices for Scaling and Optimizing Apache Spark

Holden Karau, Adi Polak & Rachel Warren

Paperback412 Pages

PublisherO'Reilly

Edition2

LanguageEnglish

Year2026

ISBN9781098145859

830

A4763

انتخاب نوع چاپ:نوع چاپ صفحات را انتخاب کنید:

جلد سخت

1,102,000تتومان

جلد نرم

972,000تتومان

طلق پاپکو و فنر

992,000تتومان

مجموع:

0تومان

کیفیت متن:اورجینال انتشارات

قطع:B5

رنگ صفحات:دارای متن و کادر رنگی

پشتیبانی در روزهای تعطیل!

ارسال به سراسر کشور

#Spark

#Apache_Spark

#Data

#SQL

#Kubernetes

#GPU

#PySpark

#data_engineers

#software_engineers

#data_scientists

#Scala

توضیحات

⚡ وقتی همه‌چیز درست کنار هم قرار میگیره، Apache Spark واقعاً فوق‌العاده‌ست. ولی اگر تا الان اون بهبود پرفورمنسی که انتظار داشتی رو ازش نگرفتی، یا هنوز آن‌قدر خیالت راحت نیست که Spark رو وارد پروداکشن کنی، این کتاب عملی دقیقاً برای توئه. هولدن کارائو، آدی پولاک و ریچل وارن توی این کتاب میبرنت زیر پوست کدبیس Spark و بهت نشون میدن چه بهینه‌سازی‌هایی باعث میشن خط‌های پردازش داده‌ات سریع‌تر اجرا بشن، روی دیتاست‌های بزرگتر Scale کنن و از آنتی‌پترن‌های پرهزینه دور بمونن.

👨‍💻 این ویرایش دوم High Performance Spark برای مهندس‌های داده، مهندس‌های نرم‌افزار، دانشمندهای داده و ادمین‌های سیستم مناسبه. کتاب use caseهای جدید، مثال‌های کدنویسی و Best Practiceهای مربوط به Spark 4.x و نسخه‌های بعد از اون رو پوشش میده. این کتاب کمک می‌کنه نگاه تازه‌ای به این فریم‌ورک همیشه در حال تغییر داشته باشی و یاد بگیری چطور با چالش‌هایی که توی مسیر کار با Spark و PySpark پیش میاد کنار بیای.

🎯 با این کتاب یاد می‌گیری چطور:

🤖 Workflowهای یادگیری ماشینت رو با Integrationهایی مثل PyTorch سریع‌تر کنی

📊 با مشکل Key Skew درست برخورد کنی و از Dynamic Partitioning جدید Spark استفاده کنی

✅ کدت رو با تکنیک‌های تست و اعتبارسنجی مقیاس‌پذیر، قابل اعتمادتر کنی

⚡ Spark رو واقعاً High Performance اجرا کنی

☸️ Spark رو روی Kubernetes و محیط‌های مشابه Deploy کنی

🚀 از شتاب‌دهی GPU با RAPIDS و Resource Profileها استفاده کنی

🏃 کاری کنی Jobهای Spark سریع‌تر اجرا بشن

📦 از Spark برای Production کردن پروژه‌های اکتشافی Data Science استفاده کنی

🗄️ با Spark دیتاست‌های حتی بزرگ‌تر رو مدیریت کنی

📈 با کم کردن زمان اجرای Pipelineها، سریع‌تر به Insight برسی

📖 فهرست مطالب

فصل ۱. مقدمه‌ای بر High Performance Spark

فصل ۲. Spark چطور کار می‌کند

فصل ۳. ارتقای Spark

فصل ۴. از Spark 2.4 تا Spark 4.2 چه چیزهایی جدید شده

فصل ۵. DataFrameها، Datasetها و Spark SQL

فصل ۶. Joinها؛ هم در SQL و هم در Core

فصل ۷. Transformationهای مؤثر

فصل ۸. کار با داده‌های Key/Value

فصل ۹. فراتر رفتن از Scala

فصل ۱۰. Spark: یک قدم حساب‌شده به سمت Generative AI

فصل ۱۱. تست، اعتبارسنجی و اجرای Side-by-Side

فصل ۱۲. کامپوننت‌ها و پکیج‌های Spark

پیوست A. به‌اندازه کافی درباره Iceberg و ابزارهای اطرافش

پیوست B. Spark Connect

پیوست C. چه زمانی نباید از Spark استفاده کرد

پیوست D. زمان‌بندی پیشرفته Taskها: Gangها و Resource Profileها

پیوست E. Spark Streaming

پیوست F. Spark Web UI: دیباگ و بهینه‌سازی Jobها

📌 از مقدمه کتاب

🧑‍💻 ما این کتاب رو برای مهندس‌های داده، دانشمندهای داده و متخصص‌های ML نوشتیم؛ برای کسانی که میخوان بیشترین استفاده ممکن رو از Spark ببرن. اگر مدتیه با Spark کار می‌کنی و براش وقت گذاشتی، ولی تجربه‌ات تا اینجا پر از خطاهای حافظه و Failureهای عجیب، مقطعی و سخت‌ردیابی بوده، این کتاب برای توئه. اگر Spark رو برای کارهای اکتشافی استفاده کردی یا کنار کار اصلیت باهاش آزمایش کردی، ولی هنوز آن‌قدر مطمئن نیستی که وارد Productionش کنی، این کتاب می‌تونه کمکت کنه. اگر به Spark علاقه داری، ولی اون بهبود پرفورمنسی که انتظار داشتی رو ازش نگرفتی، امیدواریم این کتاب به دردت بخوره. البته این کتاب برای کسانی نوشته شده که یک شناخت کاری از Spark دارن و اگر تجربه کمی از Spark یا پردازش توزیع‌شده داشته باشی، ممکنه بعضی بخش‌هاش سخت‌تر فهمیده بشه.

🧑‍💻 ما این کتاب رو برای مهندس‌های داده، دانشمندهای داده و متخصص‌های یادگیری ماشین نوشتیم؛ برای کسایی که می‌خوان بیشترین استفاده ممکن رو از Spark ببرن. اگر مدتیه با Spark کار می‌کنی و براش وقت گذاشتی، ولی تجربه‌ات تا اینجا پر از خطاهای حافظه و فیل شدن‌های عجیب، مقطعی و سخت‌ردیابی بوده، این کتاب برای توئه. اگر Spark رو برای کارهای اکتشافی استفاده کردی یا کنار کار اصلیت باهاش آزمایش کردی، ولی هنوز آن‌قدر مطمئن نیستی که وارد پروداکشنش کنی، این کتاب می‌تونه کمکت کنه. اگر به Spark علاقه داری، ولی اون بهبود پرفورمنسی که انتظار داشتی رو ازش نگرفتی، امیدواریم این کتاب به دردت بخوره. البته این کتاب برای کسایی نوشته شده که یک شناخت کاری از Spark دارن و اگر تجربه کمی از Spark یا پردازش توزیع‌شده داشته باشی، ممکنه بعضی بخش‌هاش سخت‌تر فهمیده بشه.

⚙️ به نظر ما این متن بیشتر به درد کسایی می‌خوره که دغدغه‌شون بهینه‌سازی کوئری‌های تکرارشونده در پروداکشنه، نه کسایی که بیشتر مشغول کارهای اکتشافی هستن. نوشتن کوئری‌های سریع احتمالاً برای مهندس داده مهم‌تره، اما وقتی پای Spark وسطه، برخلاف بعضی فریم‌ورک‌های دیگه، فقط بلد بودن ابزار کافی نیست؛ باید خود داده رو هم خوب بشناسی. این بخش معمولاً برای دانشمند داده طبیعی‌تره، چون بیشتر با ماهیت آماری، توزیع و شکل داده سروکار داره. برای همین، این کتاب می‌تونه برای مهندس داده‌ای مفیدتر باشه که شاید هنوز عادت نکرده موقع بررسی پرفورمنس، جدی‌تر به ماهیت آماری، توزیع و لی‌اوت داده فکر کنه. امیدواریم این کتاب کمک کنه مهندس‌های داده وقتی پایپ‌لاین‌ها رو وارد پروداکشن می‌کنن، با دید دقیق‌تری به داده نگاه کنن. از اون طرف، برای دانشمندهای داده هم می‌خوایم توضیح بدیم Spark چطور کار می‌کنه، تا بتونن شناختی که از داده دارن رو برای نوشتن کوئری‌های با پرفورمنس بالا استفاده کنن. می‌خوایم خواننده‌ها یاد بگیرن سؤال‌هایی مثل این‌ها بپرسن: «داده من چطور توزیع شده؟» «Skew داره؟» «رنج مقدارهای یک ستون چقدره؟» و «انتظار داریم یک مقدار مشخص چطور گروه‌بندی بشه؟» و بعد جواب این سؤال‌ها رو وارد منطق کوئری‌های Spark خودشون کنن.

📊 با این حال، حتی اگر دانشمند داده‌ای هستی که بیشتر برای کارهای اکتشافی از Spark استفاده می‌کنی، این کتاب باید یک شهود مهم درباره نوشتن کوئری‌های سریع در Spark بهت بده. چون وقتی مقیاس تحلیل اکتشافی، طبق معمول، بزرگ‌تر و بزرگ‌تر می‌شه، احتمال بیشتری داری چیزی که می‌نویسی همون بار اول اجرا بشه. امیدواریم به دانشمندهای داده کمک کنیم، حتی اون‌هایی که از قبل با فکر کردن به داده به شکل توزیع‌شده راحت هستن، دقیق‌تر به این فکر کنن که برنامه‌هاشون چطور ارزیابی می‌شن. این نگاه کمک می‌کنه داده‌هاشون رو کامل‌تر و سریع‌تر اکسپلور کنن و با کسایی که قرار است الگوریتم‌هاشون رو وارد پروداکشن کنن، بهتر و دقیق‌تر ارتباط بگیرن.

📈 فرقی نمی‌کنه عنوان شغلیت چیه؛ احتمالاً حجم داده‌ای که باهاش کار می‌کنی داره سریع زیاد می‌شه. راه‌حل‌های اولیه‌ات ممکنه نیاز به اسکیل شدن داشته باشن و تکنیک‌های قدیمی‌ای که برای حل مسئله‌های جدید استفاده می‌کردی، شاید باید به‌روزرسانی بشن. امیدواریم این کتاب کمک کنه از Apache Spark برای حل مسئله‌های جدید راحت‌تر استفاده کنی و مسئله‌های قدیمی رو با کارایی بهتر جلو ببری.

👤 درباره نویسندگان

👩‍💻 هولدن کارائو یک کانادایی کوئیر و ترنس‌جندر، Committer در Apache Spark، عضو Apache Software Foundation و یکی از مشارکت‌کننده‌های فعال دنیای Open Source است. او به‌عنوان مهندس نرم‌افزار روی مسائل مختلفی در پردازش توزیع‌شده، جست‌وجو، AI و Classification در شرکت‌هایی مثل Apple، Netflix، Google، IBM، Alpine، Databricks، Foursquare و Amazon کار کرده. او از دانشگاه واترلو با مدرک کارشناسی ریاضیات در علوم کامپیوتر فارغ‌التحصیل شده. علاوه بر کارهایش در حوزه Big Data، او یکی از بنیان‌گذارهای Fight Health Insurance هم هست؛ پروژه‌ای که به بیمارها کمک می‌کند نسبت به رد شدن درخواست‌های بیمه درمانی اعتراض کنند. بیرون از دنیای نرم‌افزار، از بازی با آتش، جوشکاری، موتورسواری، خوردن پوتین و رقصیدن لذت می‌برد.

🧠 بخش زیادی از زندگی حرفه‌ای آدی با داده و یادگیری ماشین گره خورده. او به‌عنوان یک Data Practitioner، الگوریتم‌هایی توسعه داده که با تکنیک‌های Machine Learning مسائل واقعی را حل می‌کنند. به‌عنوان مهندس، مسیری را رهبری کرده که تجربه عملی‌اش در یادگیری ماشین را وارد محصولات و سرویس‌های شرکت‌های مختلف Fortune 500 کرده؛ آن هم با تکیه بر تکنولوژی‌های پیشرفته و نوظهور. آدی از سال ۲۰۱۳ با جامعه Apache Spark کار کرده و در آن مشارکت داشته و در طول این سال‌ها Spark را به هزاران دانشجو آموزش داده است. او Ambassador رسمی Databricks، نویسنده کتاب موفق Scaling Machine Learning with Spark و یک سخنران شناخته‌شده در سطح جهانی است.

📊 ریچل وارن دانشمند داده و مهندس نرم‌افزار در Alpine Data Labs است؛ جایی که از Spark برای حل چالش‌های واقعی پردازش داده استفاده می‌کند. او تجربه کار به‌عنوان تحلیلگر را هم در صنعت و هم در فضای دانشگاهی دارد. ریچل مدرک علوم کامپیوتر خود را از دانشگاه وسلیان در کنتیکت گرفته است.

Apache Spark is amazing when everything clicks. But if you haven't seen the performance improvements you expected or still don't feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau, Adi Polak, and Rachel Warren walk you through the secrets of the Spark code base and demonstrate performance optimizations that will help your data pipelines run faster, scale to larger datasets, and avoid costly antipatterns.

Ideal for data engineers, software engineers, data scientists, and system administrators, the second edition of High Performance Spark presents new use cases, code examples, and best practices for Spark 4.x and beyond. This book gives you a fresh perspective on this continually evolving framework and shows you how to work around bumps on your Spark and PySpark journey.

With this book, you'll learn how to:

Accelerate your ML workflows with integrations including PyTorch
Handle key skew and take advantage of Spark's new dynamic partitioning
Make your code reliable with scalable testing and validation techniques
Make Spark high performance
Deploy Spark on Kubernetes and similar environments
Take advantage of GPU acceleration with RAPIDS and resource profiles
Get your Spark jobs to run faster
Use Spark to productionize exploratory data science projects
Handle even larger datasets with Spark
Gain faster insights by reducing pipeline running times

Table of Contents

Chapter 1. Introduction to High Performance Spark

Chapter 2. How Spark Works

Chapter 3. Upgrading Spark

Chapter 4. What's New in Spark 4.2 Since 2.4

Chapter 5. DataFrames, Datasets, and Spark SQL

Chapter 6. Joins (SQL and Core)

Chapter 7. Effective Transformations

Chapter 8. Working with Key/Value Data

Chapter 9. Going Beyond Scala

Chapter 10. Spark: A Thoughtful Step into Generative AI

Chapter 11. Testing, Validation, and Side-By-Side Runs

Chapter 12. Spark Components and Packages

Appendix A. Just Enough Iceberg and Friends

Appendix B. Spark Connect

Appendix C. When Not to Use Spark

Appendix D. Advanced Task Scheduling: Gangs and Resource Profiles

Appendix E. Spark Streaming

Appendix F. The Spark Web UI: Debugging and Optimizing Your Jobs

From the Preface

We wrote this book for data engineers, data scientists, and ML practitioners who are looking to get the most out of Spark. If you’ve been working with Spark and invested in Spark but your experience so far has been mired by memory errors and mysterious, intermittent failures, this book is for you. If you have been using Spark for some exploratory work or experimenting with it on the side but have not felt confident enough to put it into production, this book may help. If you are enthusiastic about Spark but have not seen the performance improvements from it that you expected, we hope this book can help. This book is intended for those who have some working knowledge of Spark and may be difficult to understand for those with little or no experience with Spark or distributed computing.

We expect this text will be most useful to those who care about optimizing repeated queries in production, rather than to those who are primarily doing exploratory work. While writing highly performant queries is perhaps more important to the data engineer, writing those queries with Spark, in contrast to other frameworks, requires a good knowledge of the data, which is usually more intuitive to the data scientist. Thus, it may be more useful to a data engineer who may be less experienced with thinking critically about the statistical nature, distribution, and layout of data when considering performance. We hope that this book will help data engineers think more critically about their data as they put pipelines into production. Similarly for data scientists we hope to provide more understanding of how Spark works so they can use their knowledge of the data for high performance queries. We want to help our readers ask questions such as “How is my data distributed?” “Is it skewed?” “What is the range of values in a column?” and “How do we expect a given value to group?” and then apply the answers to those questions to the logic of their Spark queries.

However, even for data scientists using Spark mostly for exploratory purposes, this book should cultivate some important intuition about writing performant Spark queries, so that as the scale of the exploratory analysis inevitably grows, you may have a better shot of getting something to run the first time. We hope to guide data scientists, even those who are already comfortable thinking about data in a distributed way, to think critically about how their programs are evaluated, empowering them to explore their data more fully and more quickly and to communicate effectively with anyone helping them put their algorithms into production.

Regardless of your job title, it is likely that the amount of data with which you are working is growing quickly. Your original solutions may need to be scaled, and your old techniques for solving new problems may need to be updated. We hope this book will help you leverage Apache Spark to tackle new problems more easily and old problems more efficiently.

About the Author

Holden Karau is a queer transgender Canadian, Apache Spark committer, Apache Software Foundation member, and an active open source contributor. As a software engineer, she's worked on a variety of distributed computing, search, AI, and classification problems at Apple, Netflix, Google, IBM, Alpine, Databricks, Foursquare, and Amazon. She graduated from the University of Waterloo with a bachelor of mathematics in computer science. In addition to her big data work, she cofounded Fight Health Insurance to help patients appeal health insurance denials. Outside of software, she enjoys playing with fire, welding, riding motorcycles, eating poutine, and dancing.

For most of Adi's professional life, she dealt with data and machine learning. As a data practitioner, she developed algorithms to solve real-world problems using machine-learning techniques. As an engineer, she led the direction that brought the value of her hands-on machine learning experience into various Fortune 500 companies' products and services by building upon cutting-edge and emerging technologies. Adi has been working and contributing to the Apache Spark community since 2013 and taught Spark to thousands of students throughout the year. Adi is an official Databricks ambassador, the author of the successful book - Scaling Machine Learning with Spark, and a respected worldwide presenter.

Rachel Warren is a data scientist and software engineer at Alpine Data Labs, where she uses Spark to address real world data processing challenges. She has experience working as an analyst both in industry and academia. She graduated with a degree in Computer Science from Wesleyan University in Connecticut.

High Performance Spark

Best Practices for Scaling and Optimizing Apache Spark

Holden Karau, Adi Polak & Rachel Warren

%0 رضایت مشتری

انتخاب نوع چاپ:نوع چاپ:

جلد سخت

1,102,000تتومان

جلد نرم

972,000تتومان

طلق پاپکو و فنر

992,000تتومان

مجموع:

0تومان

قیمت مناسب

تضمین کیفیت

ارسال سریع

خرید آسان

دیدگاه خود را بنویسید

نظرات کاربران (0 دیدگاه)

نظری وجود ندارد.

کتاب های مشابه

Data

415

Getting Started with DuckDBGetting Started with DuckDB

946,000 تومان

Data

415

Getting Started with DuckDBGetting Started with DuckDB

946,000 تومان

Data

885

Data-Driven ModelingData-Driven Modeling

789,000 تومان

Data

885

Data-Driven ModelingData-Driven Modeling

789,000 تومان

Artificial intelligence

1,066

Data-Driven HRData-Driven HR

678,000 تومان

Artificial intelligence

1,066

Data-Driven HRData-Driven HR

678,000 تومان

Data

1,303

Managing and Visualizing Your BIM DataManaging and Visualizing Your BIM Data

1,049,000 تومان

Data

1,303

Managing and Visualizing Your BIM DataManaging and Visualizing Your BIM Data

1,049,000 تومان

Data

541

DuckDB: Up and RunningDuckDB: Up and Running

806,000 تومان

Data

541

DuckDB: Up and RunningDuckDB: Up and Running

806,000 تومان

Data

982

Cost-Effective Data PipelinesCost-Effective Data Pipelines

770,000 تومان

Data

982

Cost-Effective Data PipelinesCost-Effective Data Pipelines

770,000 تومان

Data

975

Information Modeling and Relational DatabasesInformation Modeling and Relational Databases

2,283,000 تومان

Data

975

Information Modeling and Relational DatabasesInformation Modeling and Relational Databases

2,283,000 تومان

Cloud

998

Designing Cloud Data PlatformsDesigning Cloud Data Platforms

861,000 تومان

Cloud

998

Designing Cloud Data PlatformsDesigning Cloud Data Platforms

861,000 تومان

Data

882

Data Labeling in Machine Learning with PythonData Labeling in Machine Learning with Python

977,000 تومان

Data

882

Data Labeling in Machine Learning with PythonData Labeling in Machine Learning with Python

977,000 تومان

Data

779

Learning Apache DrillLearning Apache Drill

849,000 تومان

Data

779

Learning Apache DrillLearning Apache Drill

849,000 تومان

کتاب های مشابه

Data

415

Getting Started with DuckDBGetting Started with DuckDB

946,000 تومان

Data

415

Getting Started with DuckDBGetting Started with DuckDB

946,000 تومان

Data

885

Data-Driven ModelingData-Driven Modeling

789,000 تومان

Data

885

Data-Driven ModelingData-Driven Modeling

789,000 تومان

Artificial intelligence

1,066

Data-Driven HRData-Driven HR

678,000 تومان

Artificial intelligence

1,066

Data-Driven HRData-Driven HR

678,000 تومان

Data

1,303

Managing and Visualizing Your BIM DataManaging and Visualizing Your BIM Data

1,049,000 تومان

Data

1,303

Managing and Visualizing Your BIM DataManaging and Visualizing Your BIM Data

1,049,000 تومان

Data

541

DuckDB: Up and RunningDuckDB: Up and Running

806,000 تومان

Data

541

DuckDB: Up and RunningDuckDB: Up and Running

806,000 تومان

Data

982

Cost-Effective Data PipelinesCost-Effective Data Pipelines

770,000 تومان

Data

982

Cost-Effective Data PipelinesCost-Effective Data Pipelines

770,000 تومان

Data

975

Information Modeling and Relational DatabasesInformation Modeling and Relational Databases

2,283,000 تومان

Data

975

Information Modeling and Relational DatabasesInformation Modeling and Relational Databases

2,283,000 تومان

Cloud

998

Designing Cloud Data PlatformsDesigning Cloud Data Platforms

861,000 تومان

Cloud

998

Designing Cloud Data PlatformsDesigning Cloud Data Platforms

861,000 تومان

Data

882

Data Labeling in Machine Learning with PythonData Labeling in Machine Learning with Python

977,000 تومان

Data

882

Data Labeling in Machine Learning with PythonData Labeling in Machine Learning with Python

977,000 تومان

Data

779

Learning Apache DrillLearning Apache Drill

849,000 تومان

Data

779

Learning Apache DrillLearning Apache Drill

849,000 تومان

قیمت
منصفانه

ارسال به
سراسر کشور

تضمین
کیفیت

پشتیبانی در
روزهای تعطیل

خرید امن
و آسان

آرشیو بزرگ
کتاب‌های تخصصی

هـر روز با بهتــرین و جــدیــدتـرین
کتاب های روز دنیا با ما همراه باشید

هــر روز با بهتــرین و جــدیدتـرین
کتاب های روز دنیا با ما همراه باشید

آدرس

پشتیبانی

مدیریت

ساعات پاسخگویی

درباره اسکای بوک

دسترسی های سریع

راهنمای خرید
راهنمای ارسال
سوالات متداول
قوانین و مقررات
وبلاگ
درباره ما