What is Spark?
Spark is a distributed computing system that is used within Foundry to run data transformations at scale. It was originally created by a team of researchers at UC Berkeley and was subsequently donated to the Apache Foundation in the late 2000s. Foundry allows you to run SQL, Python, Java, and Mesa transformations (Mesa is a proprietary Java-based DSL) on large amounts of data using Spark as the foundational computation layer.
How does Spark work?
Spark relies on distributing jobs across many computers at once to process data. This process allows for simultaneous jobs to run quickly across users and projects with a method known as MapReduce. These computers are divided into drivers and executors.
EXECUTOR_MEMORY_SMALL
to EXECUTOR_MEMORY_MEDIUM
, then run the job again before adjusting anything else. This helps prevent incurring unnecessary costs by over-allocating resources to your job.SMALL
.
EXECUTOR_CORES_SMALL
, EXECUTOR_MEMORY_SMALL
, DRIVER_CORES_SMALL
, DRIVER_MEMORY_SMALL
, NUM_EXECUTORS_2
.SMALL
to MEDIUM
. This should help if you are processing large amounts of data.
MEDIUM
to LARGE
, consult an expert for help. Consider simplifying your transform if possible, as described in the troubleshooting guide.NUM_EXECUTORS_32
and EXECUTOR_MEMORY_LARGE
(and above) should be available only upon request and approval of that request.EXECUTOR_CORES_SMALL
should be heavily controlled (because this is a "stealth" way to increase computing power and it is preferable to funnel users to NUM_EXECUTORS
profiles in almost all cases).EXECUTOR_CORES_SMALL
and EXECUTOR_MEMORY_MEDIUM
) should be approved by an administrator. Block off EXECUTOR_CORES_EXTRA_SMALL
and EXECUTOR_MEMORY_LARGE
. If a user is asking for these, it usually indicates either subpar optimization or a critical workflow.