Databases | Practical class 9
Using Aggregate Functions
1. Introduction to Aggregate Functions
In database management, we often need to perform calculations not just on individual rows, but across multiple rows to get summary information. For instance, finding the total number of employees, the average salary in a department, or the maximum price of a product.
Aggregate functions are designed precisely for this purpose. They take a set of input values (typically from a column across multiple rows) and return a single summary value.
In SQL, and specifically PostgreSQL, the most common aggregate functions are:
COUNT()
: Counts the number of rows or values.SUM()
: Calculates the sum of values in a column.AVG()
: Computes the average of values in a column.MIN()
: Finds the minimum value in a column.MAX()
: Finds the maximum value in a column.
These functions are essential tools for data analysis and reporting directly within the database.
2. Setting Up the Sample Data
For our examples, we will use a simplified employees
table. If you don’t have this table, please execute the following SQL commands in the pgAdmin 4 Query Tool to create and populate it.
Using pgAdmin 4 Query Tool:
- In the pgAdmin 4 browser tree, navigate to your database.
- Right-click on the database name (e.g.,
postgres
) or thepublic
schema. - Select “Query Tool”.
- Copy and paste the SQL code below into the Query Editor panel.
- Execute the code (click the “Execute/Refresh” button - often a lightning bolt icon, or press F5).
-- Drop table if it exists to start fresh
DROP TABLE IF EXISTS employees;
DROP TABLE IF EXISTS departments;
-- Create Departments Table
CREATE TABLE departments (
department_id SERIAL PRIMARY KEY,
department_name VARCHAR(100) NOT NULL UNIQUE
);
-- Create Employees Table
CREATE TABLE employees (
employee_id SERIAL PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
hire_date DATE NOT NULL,
job_title VARCHAR(100),
salary NUMERIC(10, 2),
department_id INT,
FOREIGN KEY (department_id) REFERENCES departments(department_id)
);
-- Insert Sample Data into Departments
INSERT INTO departments (department_name) VALUES
('Human Resources'),
('Engineering'),
('Sales'),
('Marketing');
-- Insert Sample Data into Employees
INSERT INTO employees (first_name, last_name, hire_date, job_title, salary, department_id) VALUES
('Alice', 'Smith', '2020-03-15', 'Software Engineer', 75000.00, 2),
('Bob', 'Johnson', '2019-08-01', 'Senior Software Engineer', 95000.00, 2),
('Charlie', 'Williams', '2021-01-10', 'HR Manager', 80000.00, 1),
('Diana', 'Brown', '2022-05-20', 'Sales Representative', 60000.00, 3),
('Ethan', 'Jones', '2021-11-01', 'Marketing Specialist', 65000.00, 4),
('Fiona', 'Garcia', '2020-07-12', 'Software Engineer', 72000.00, 2),
('George', 'Miller', '2022-02-28', 'Sales Manager', 85000.00, 3),
('Hannah', 'Davis', '2023-01-05', 'Junior Developer', 60000.00, 2),
('Ian', 'Rodriguez', '2021-06-15', NULL, 55000.00, 3), -- NULL job title
('Julia', 'Martinez', '2019-12-01', 'HR Assistant', 48000.00, 1),
('Kevin', 'Hernandez', '2023-03-01', 'Intern', 30000.00, NULL); -- NULL department_id
Verify the data by running SELECT * FROM employees;
and SELECT * FROM departments;
in the Query Tool.
3. Core Aggregate Functions Explained
Let’s explore each function with examples using our employees
table. Execute these queries in the pgAdmin 4 Query Tool.
3.1. COUNT()
The COUNT()
function counts the number of rows or values.
-
COUNT(*)
: Counts the total number of rows in the result set (including rows with NULL values).-- Count all employees SELECT COUNT(*) AS total_employees FROM employees;
Result: This will return the total number of rows inserted into the
employees
table (which is 11 in our sample data). -
COUNT(column_name)
: Counts the number of rows where the specifiedcolumn_name
is NOT NULL.-- Count employees with a known job title SELECT COUNT(job_title) AS employees_with_job_title FROM employees;
Result: This counts rows where
job_title
is not NULL (10 in our sample data, excluding Ian Rodriguez).-- Count employees assigned to a department SELECT COUNT(department_id) AS employees_in_departments FROM employees;
Result: This counts rows where
department_id
is not NULL (10 in our sample data, excluding Kevin Hernandez). -
COUNT(DISTINCT column_name)
: Counts the number of unique, non-NULL values in the specified column.-- Count the number of distinct job titles SELECT COUNT(DISTINCT job_title) AS distinct_job_titles FROM employees;
Result: This returns the count of unique job titles present (e.g., ‘Software Engineer’, ‘Senior Software Engineer’, ‘HR Manager’, etc., excluding NULL). Check the data to verify the count (should be 8).
3.2. SUM()
The SUM()
function calculates the sum of all non-NULL values in a numeric column. It returns NULL if there are no non-NULL values.
-- Calculate the total salary expenditure
SELECT SUM(salary) AS total_salary_expenditure
FROM employees;
Result: This adds up all the values in the salary
column for all employees.
-- Calculate the total salary for the Engineering department (department_id = 2)
-- Note: This uses WHERE, covered in P5.
SELECT SUM(salary) AS engineering_salary_total
FROM employees
WHERE department_id = 2;
Result: This sums the salaries only for employees whose department_id
is 2.
3.3. AVG()
The AVG()
function calculates the average (mean) of all non-NULL values in a numeric column. It ignores NULL values in its calculation.
-- Calculate the average salary of all employees
SELECT AVG(salary) AS average_salary
FROM employees;
Result: This returns the average of all non-NULL salary values. PostgreSQL typically returns a numeric
or double precision
type for accuracy.
-- Calculate the average salary for 'Software Engineer' job title
SELECT AVG(salary) AS avg_salary_software_engineer
FROM employees
WHERE job_title = 'Software Engineer';
Result: This calculates the average salary specifically for employees with the job title ‘Software Engineer’.
3.4. MIN()
The MIN()
function finds the minimum (lowest) value in a column among non-NULL values. It works with numeric, string, and date/time types.
-- Find the lowest salary among all employees
SELECT MIN(salary) AS lowest_salary
FROM employees;
Result: Returns the minimum value from the salary
column.
-- Find the earliest hire date
SELECT MIN(hire_date) AS earliest_hire_date
FROM employees;
Result: Returns the earliest date from the hire_date
column.
3.5. MAX()
The MAX()
function finds the maximum (highest) value in a column among non-NULL values. It also works with numeric, string, and date/time types.
-- Find the highest salary among all employees
SELECT MAX(salary) AS highest_salary
FROM employees;
Result: Returns the maximum value from the salary
column.
-- Find the latest hire date
SELECT MAX(hire_date) AS latest_hire_date
FROM employees;
Result: Returns the most recent date from the hire_date
column.
4. Using Aggregate Functions with GROUP BY
The true power of aggregate functions is often realized when used in conjunction with the GROUP BY
clause (covered in P6). GROUP BY
divides the rows of a table into groups based on the values in one or more columns. Aggregate functions then operate independently on each of these groups.
Rule Reminder: When using GROUP BY
, any column in the SELECT
list that is not an aggregate function must be included in the GROUP BY
clause. This is because the database needs to know how to present the non-aggregated values for each group (they must be the same within the group).
Examples:
-- Count the number of employees in each department
SELECT
department_id,
COUNT(*) AS num_employees -- Counts all employees within each group
FROM employees
WHERE department_id IS NOT NULL -- Optional: Exclude employees with no department
GROUP BY department_id
ORDER BY department_id; -- Optional: Sort the results
Result: Shows each department_id
and the count of employees belonging to it.
-- Calculate the average salary for each job title
SELECT
job_title,
AVG(salary) AS average_salary_for_job,
COUNT(*) AS number_of_employees_with_job
FROM employees
WHERE job_title IS NOT NULL -- Exclude employees with NULL job title
GROUP BY job_title
ORDER BY average_salary_for_job DESC; -- Optional: Sort by average salary
Result: Lists each non-NULL job_title
along with the average salary and employee count for that title.
-- Find the minimum and maximum salary within each department
SELECT
department_id,
MIN(salary) AS min_salary_in_dept,
MAX(salary) AS max_salary_in_dept
FROM employees
WHERE department_id IS NOT NULL
GROUP BY department_id
ORDER BY department_id;
Result: Shows each department_id
with the lowest and highest salary found within that department.
5. Filtering Groups with HAVING
Recall from P6 that the HAVING
clause is used to filter the results of a GROUP BY
query based on the outcome of aggregate functions. While WHERE
filters rows before they are grouped and aggregated, HAVING
filters the groups themselves after aggregation.
Examples:
-- Find departments with more than 2 employees
SELECT
department_id,
COUNT(*) AS num_employees
FROM employees
WHERE department_id IS NOT NULL
GROUP BY department_id
HAVING COUNT(*) > 2 -- Filter groups based on the aggregate COUNT result
ORDER BY department_id;
Result: Only lists department_id
s where the employee count for that group is greater than 2.
-- Find job titles where the average salary is greater than $70,000
SELECT
job_title,
AVG(salary) AS average_salary_for_job
FROM employees
WHERE job_title IS NOT NULL AND salary IS NOT NULL
GROUP BY job_title
HAVING AVG(salary) > 70000 -- Filter groups based on the average salary calculation
ORDER BY average_salary_for_job DESC;
Result: Only lists job titles whose average salary exceeds $70,000.
6. Important Considerations
- NULL Handling:
COUNT(*)
counts all rows.COUNT(column_name)
andCOUNT(DISTINCT column_name)
ignore NULLs in the specified column.SUM()
,AVG()
,MIN()
,MAX()
all ignore NULL values in their calculations. Be mindful ofAVG()
especially, as ignoring NULLs can affect the result compared to treating them as zero (if that were desired, you would need to handle NULLs explicitly, e.g., usingCOALESCE(salary, 0)
- thoughCOALESCE
might be considered part of P9 standard functions).
- Data Types:
SUM()
andAVG()
operate on numeric types (INTEGER
,NUMERIC
,REAL
,DOUBLE PRECISION
, etc.).MIN()
andMAX()
work on any data type that can be ordered (numbers, strings, dates, etc.).COUNT()
works with any data type as it primarily counts rows or non-NULL occurrences.
GROUP BY
Clause Requirement: As reiterated earlier, non-aggregated columns in theSELECT
list must appear in theGROUP BY
clause.