Code Workbook allows users to use both Spark R and native R. Spark R provides a distributed data frame implementation that supports operations like selection, filtering, and aggregation on large datasets. While users may be more familiar with native R, it is recommended that users first use SparkR to filter large datasets before using native R.
Common SparkR operations
Read the full API documentation ↗ for SparkR to see all possible operations. Below, we outline syntax for common operations.
Filtering
Filter experessions can be a SQL-like WHERE clause passed as a string.
1
2
3
4
# Add two columns
df <- SparkR::withColumn(df, 'col1_plus_col2', df$col1 + df$col2)
# Multiply a column by a constant
df <- SparkR::withColumn(df, 'col1_times_60', df$col1 * 60)
Aggregations
Use SparkR::groupBy and SparkR::agg to compute aggregates. Calling SparkR::groupBy will create a group by object. Pass the group by object into SparkR::agg to get an aggregated dataframe.