Open in app

Sign In

Write

Sign In

Ajay Gupta
Ajay Gupta

675 Followers

Home

About

Jun 29

Cluster computing goes local with Spark Connect

Gone are days when Data/ML engineers need to repeatedly package their data processing logic in a Spark App and submit it to the cluster in order to test, customize and optimize the data logic. Spark connect can now power the local computing environment across all platforms with seamless access to Spark’s cluster computing engine. — Cluster computing on Spark is accessible mainly either thru Spark shell being launched on a node that has access to the cluster, or by packaging the desired data processing logic in a Spark App and submitting the later to the cluster manager via spark-submit command, but the submission again has…

Programming

3 min read

Cluster computing goes local with Spark Connect
Cluster computing goes local with Spark Connect
Programming

3 min read


Jun 23

Z Order Optimization for Generic Multi Dimensional predicates

Data Skipping logic constitutes an integral part of Advanced Table formats storing huge data sets. Read this blog to understand how Z order optimization helps these formats to achieve it effectively when predicate patterns are not known prior. — Data skipping at source is very important for executing queries against large data tables based on advanced table formats, such as Delta and Iceberg. …

Data Science

4 min read

Z Order Optimization for Generic Multi Dimensional predicates
Z Order Optimization for Generic Multi Dimensional predicates
Data Science

4 min read


Published in

Towards Data Science

·Jun 17, 2022

Five Tips to Fasten Skewed Joins in Apache Spark

Skewed Joins lead to stragglers in a Spark Job bringing down the overall efficiency of the Job. Here are the five exclusive tips to address Skewed Joins in different situations. — Joins are the one of the most fundamental transformations in a typical data processing routine. A Join operator makes it possible to correlate, enrich and filter across two input datasets. The two input datasets are generally classified as a left dataset and a right dataset based on their placing with…

Data Science

9 min read

Five Tips to Fasten Skewed Joins in Apache Spark
Five Tips to Fasten Skewed Joins in Apache Spark
Data Science

9 min read


Published in

Towards Data Science

·Jun 6, 2022

Coalescing Vs. Dynamic Coalescing in Apache Spark

Use of Coalesce in Spark applications is set to increase with the default enablement of ‘Dynamic Coalescing’ in Spark 3.0. Now, you don’t need to do manual adjustments of partitions for shuffles any more, nor you would feel restricted from ‘spark.sql.shuffle.partitions’ value. Read this story to know more about it.. — Importance of Partitioning: Right set of partitions is like a holy grail for optimum execution of a Spark application. A Spark application achieves optimum efficiency when each of its constituent stages execute optimally. This in turn implies that each of the stage should run on optimum number of partitions which…

Apache Spark

7 min read

Coalescing Vs. Dynamic Coalescing in Apache Spark
Coalescing Vs. Dynamic Coalescing in Apache Spark
Apache Spark

7 min read


May 31, 2022

Linearizability And/Vs Serializability in Distributed Databases

Linearizability And Serializability together constitute the gold standard for consistency in distributed databases. Based on this gold standard, one can easily evaluate the consistency promises of various distributed database solutions. — Before we look Linearizability And Serializability together, Let us first understand these two concepts in isolation: Linearizability: A data storage system depicts linearizability when for a set of operations, being performed on a single data object, two conditions are satisfied. First condition states that the execution is atomic for each…

Database

6 min read

Linearizability And/Vs Serializability in Distributed Databases
Linearizability And/Vs Serializability in Distributed Databases
Database

6 min read


May 15, 2022

A Tale of Two Transformations: Map & MapPartitions in Apache Spark

Spark has provided two very important transformations, viz., Map and MapPartitions, for developers, to accomplish certain custom data processing scenarios. However, due to overlap in their capabilities, there always remains some ambiguity, among the developers, to rightly choose between Map and MapPartitions… — Map and MapPartitions, both, fall in the category of narrow transformations as there is one to one mapping between output and input partitions when both gets executed. Also, both the transformations allow developers to code custom data processing logic on typed records. However, from functionality perspective:

Apache Spark

4 min read

Map vs MapPartitions in Apache Spark
Map vs MapPartitions in Apache Spark
Apache Spark

4 min read


Published in

The Startup

·Nov 22, 2020

Troubleshooting Stragglers in Your Spark Application

Stragglers in your Spark Application affect the overall application performance and waste premium resources. — Stragglers are detrimental to the overall performance of Spark applications and lead to resource wastages on the underlying cluster. Therefore, it is important to identify potential stragglers in your Spark Job, identify the root cause behind them, and put required fixes or provide preventive measures.

Technology

6 min read

Troubleshooting Stragglers in Your Spark Application
Troubleshooting Stragglers in Your Spark Application
Technology

6 min read


Published in

Towards Data Science

·Nov 2, 2020

Four Ways to Filter a Spark Dataset Against a Collection of Data Values

Filtering a Spark Dataset against a collection of data values is commonly encountered in many data analytics flows. This particular story would explain four different ways to achieve the same. — Let us assume there is a very large Dataset ‘A’ with the following schema: root: | — empId: Integer | — sal: Integer | — name: String | — address: String | — dept: Integer The Dataset ‘A’ needs to be filtered against a set of employee IDs (empIds), ‘B’…

Data Science

5 min read

Four Ways to Filter a Spark Dataset Against a Collection of Data Values
Four Ways to Filter a Spark Dataset Against a Collection of Data Values
Data Science

5 min read


Published in

Towards Data Science

·Oct 22, 2020

Demystifying Joins in Apache Spark

This story is exclusively dedicated to the Join operation in Apache Spark, giving you an overall perspective of the foundation on which Spark Join technology is built upon. — Join operations are often used in a typical data analytics flow in order to correlate two data sets. Apache Spark, being a unified analytics engine, has also provided a solid foundation to execute a wide variety of Join scenarios.

Technology

9 min read

Demystifying Joins in Apache Spark
Demystifying Joins in Apache Spark
Technology

9 min read


Oct 4, 2020

Guide to Spark Partitioning

Partitioning is one of the basic building blocks on which the Apache Spark framework has been built. Just setting the right partitioning across various stages, a lot of spark programs can be optimized right away. — I encountered Apache Spark around 4 years back, and since then, I have been architecting Spark applications that are meant for executing complex data processing flow on massively sized multiple data sets.

Technology

2 min read

Guide to Spark Partitioning
Guide to Spark Partitioning
Technology

2 min read

Ajay Gupta

Ajay Gupta

675 Followers

Leading Data Engineering Initiatives @ Jio, Apache Spark Specialist, Author, LinkedIn: https://www.linkedin.com/in/ajaywlan/

Following
  • Valentina Alto

    Valentina Alto

  • Andrea Ialenti

    Andrea Ialenti

  • Ryan Huang

    Ryan Huang

See all (5)

Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Text to speech

Teams