setrflow.blogg.se - Squeed data

Squeed data how to#

So the framework users working on different platforms will need other solutions. The framework natively supports a kind of Hive's skew join hints but as of this writing, they're only available on Databricks platform. The following images summarize both approaches: Custom solution - Apache SparkĪfter Hive, let's focus now on Apache Spark. The results of this operation are later included in the final output. Hive determines whether the group of joined row is skewed and if it's the case, it writes them on HDFS in order to launch an additional MapReduce operation for them. The logic behind skewed joins management uses the same principle. For group by operation, map output will be randomly distributed to the reducer in order to avoid skew and aggregates it with final reduce step. The implementation for both operations is similar because Hive simply creates an extra MapReduce job for skewed data. It protects skews for 2 operations, joins and group by, both with different configuration entries: Hive is one of the first Open Source solutions with built-in skew data management. Since skewed data is not a new concept in data engineering, let's analyze different solutions proposed by data frameworks and community. To put it short, skewed data occurs when most of the dataset rows are located on a small number of partitions. Now, if you make a group by key operation, you move all followers for the influential people on the same partition, you will retrieve skews, marked in red in the following image: On the opposite side, you have other users who have at most dozens of thousands of followers. To illustrate it, let's take social media influencers who have often hundred of thousands or millions of followers.

If 80% of the joined data is about the same keys, you will end up with unbalanced partitions and therefore, these unbalanced partitions will take more time to execute.Ī great real-world example of skewed data is Power Law that I shortly described in the post about Graphs and data processing post. Do you see where is the problem? Imagine that you're making a JOIN. In the data context, it means that 80% of data is produced by 20% of producers. It's also known as 80/20 rule and states that 80% of the effects come from 20% of the causes. Skewed dataĪ good concept helping to understand data skews is Pareto principle.

Squeed data how to#

The next 3 parts show how to deal with skewed data in Hive, Apache Spark and GCP. The first one presents the concept of skewed data.