There are a number of ways to produce aggregations in PySpark. We recommend this syntax as the most reliable.
Copied!1 2 3aggregated_df = df.groupBy('state').agg( F.max('city_population').alias('largest_city_in_state') )
| state | city | city_population |
|---|---|---|
| WA | Bellevue | 100000 |
| WA | Seattle | 700000 |
| PA | Philadelphia | 1500000 |
| PA | Pittsburgh | 300000 |
| state | largest_city_in_state |
|---|---|
| PA | 1500000 |
| WA | 700000 |
The .groupBy() method on a DataFrame takes an arbitrary number of columns over which to perform the aggregations. Alternatively, to aggregate across the whole DataFrame, include no columns.
Copied!1 2 3 4 5 6 7aggregated_df = df.groupBy('state', 'county').agg( F.max('city_population').alias('largest_city_in_state_county') ) aggregated_df = df.groupBy().agg( F.max('city_population').alias('largest_city_overall') )
The .agg() method on a grouped DataFrame takes an arbitrary number of aggregation functions.
Copied!1 2 3 4aggregated_df = df.groupBy('state').agg( F.max('city_population').alias('largest_city_in_state'), F.avg('city_population').alias('average_population_in_state') )
By default aggregations produce columns of the form
aggregation_name(target_column). However, column names in Foundry cannot contain parentheses or other non-alphanumeric characters. Alias each aggregation to a specific name instead.
Pivot tables in PySpark work very similarly to ordinary grouped aggregations.
Copied!1pivoted_df = df.groupBy('equipment').pivot('sensor').mean('value')
| equipment | sensor | value |
|---|---|---|
| A | temperature | 60 |
| A | temperature | 40 |
| B | speed | 6 |
| A | speed | 3 |
| equipment | temperature | speed |
|---|---|---|
| A | 50 | 3 |
| B | null | 7 |
Learn more about PySpark aggregate functions. ↗
avg(column) / mean(column)collect_list(column)collect_set(column)count(column)corr(x, y)x and y.covar_pop(col1, col2)covar_samp(col1, col2)countDistinct(column, *cols)first(column, ignorenulls=False)grouping(column)grouping_id(*cols)kurtosis(column)last(column, ignorenulls=False)max(column)min(column)skewness(column)stddev(column)stddev_pop(column)stddev_samp(column)sum(column)sumDistinct(column)var_pop(column)var_samp(column)variance(column)