Databases Practical class 9

Using Aggregate Functions

1. Introduction to Aggregate Functions

In database management, we often need to perform calculations not just on individual rows, but across multiple rows to get summary information. For instance, finding the total number of employees, the average salary in a department, or the maximum price of a product.

Aggregate functions are designed precisely for this purpose. They take a set of input values (typically from a column across multiple rows) and return a single summary value.

In SQL, and specifically PostgreSQL, the most common aggregate functions are:

COUNT(): Counts the number of rows or values.
SUM(): Calculates the sum of values in a column.
AVG(): Computes the average of values in a column.
MIN(): Finds the minimum value in a column.
MAX(): Finds the maximum value in a column.

These functions are essential tools for data analysis and reporting directly within the database.

2. Setting Up the Sample Data

For our examples, we will use a simplified employees table. If you don’t have this table, please execute the following SQL commands in the pgAdmin 4 Query Tool to create and populate it.

Using pgAdmin 4 Query Tool:

In the pgAdmin 4 browser tree, navigate to your database.
Right-click on the database name (e.g., postgres) or the public schema.
Select “Query Tool”.
Copy and paste the SQL code below into the Query Editor panel.
Execute the code (click the “Execute/Refresh” button - often a lightning bolt icon, or press F5).

-- Drop table if it exists to start fresh
DROP TABLE IF EXISTS employees;
DROP TABLE IF EXISTS departments;

-- Create Departments Table
CREATE TABLE departments (
    department_id SERIAL PRIMARY KEY,
    department_name VARCHAR(100) NOT NULL UNIQUE
);

-- Create Employees Table
CREATE TABLE employees (
    employee_id SERIAL PRIMARY KEY,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    hire_date DATE NOT NULL,
    job_title VARCHAR(100),
    salary NUMERIC(10, 2),
    department_id INT,
    FOREIGN KEY (department_id) REFERENCES departments(department_id)
);

-- Insert Sample Data into Departments
INSERT INTO departments (department_name) VALUES
('Human Resources'),
('Engineering'),
('Sales'),
('Marketing');

-- Insert Sample Data into Employees
INSERT INTO employees (first_name, last_name, hire_date, job_title, salary, department_id) VALUES
('Alice', 'Smith', '2020-03-15', 'Software Engineer', 75000.00, 2),
('Bob', 'Johnson', '2019-08-01', 'Senior Software Engineer', 95000.00, 2),
('Charlie', 'Williams', '2021-01-10', 'HR Manager', 80000.00, 1),
('Diana', 'Brown', '2022-05-20', 'Sales Representative', 60000.00, 3),
('Ethan', 'Jones', '2021-11-01', 'Marketing Specialist', 65000.00, 4),
('Fiona', 'Garcia', '2020-07-12', 'Software Engineer', 72000.00, 2),
('George', 'Miller', '2022-02-28', 'Sales Manager', 85000.00, 3),
('Hannah', 'Davis', '2023-01-05', 'Junior Developer', 60000.00, 2),
('Ian', 'Rodriguez', '2021-06-15', NULL, 55000.00, 3), -- NULL job title
('Julia', 'Martinez', '2019-12-01', 'HR Assistant', 48000.00, 1),
('Kevin', 'Hernandez', '2023-03-01', 'Intern', 30000.00, NULL); -- NULL department_id

Verify the data by running SELECT * FROM employees; and SELECT * FROM departments; in the Query Tool.

3. Core Aggregate Functions Explained

Let’s explore each function with examples using our employees table. Execute these queries in the pgAdmin 4 Query Tool.

3.1. `COUNT()`

The COUNT() function counts the number of rows or values.

COUNT(*): Counts the total number of rows in the result set (including rows with NULL values).
```
-- Count all employees
SELECT COUNT(*) AS total_employees
FROM employees;
```
Result: This will return the total number of rows inserted into the employees table (which is 11 in our sample data).
COUNT(column_name): Counts the number of rows where the specified column_name is NOT NULL.
```
-- Count employees with a known job title
SELECT COUNT(job_title) AS employees_with_job_title
FROM employees;
```
Result: This counts rows where job_title is not NULL (10 in our sample data, excluding Ian Rodriguez).
```
-- Count employees assigned to a department
SELECT COUNT(department_id) AS employees_in_departments
FROM employees;
```
Result: This counts rows where department_id is not NULL (10 in our sample data, excluding Kevin Hernandez).
COUNT(DISTINCT column_name): Counts the number of unique, non-NULL values in the specified column.
```
-- Count the number of distinct job titles
SELECT COUNT(DISTINCT job_title) AS distinct_job_titles
FROM employees;
```
Result: This returns the count of unique job titles present (e.g., ‘Software Engineer’, ‘Senior Software Engineer’, ‘HR Manager’, etc., excluding NULL). Check the data to verify the count (should be 8).

3.2. `SUM()`

The SUM() function calculates the sum of all non-NULL values in a numeric column. It returns NULL if there are no non-NULL values.

-- Calculate the total salary expenditure
SELECT SUM(salary) AS total_salary_expenditure
FROM employees;

Result: This adds up all the values in the salary column for all employees.

-- Calculate the total salary for the Engineering department (department_id = 2)
-- Note: This uses WHERE, covered in P5.
SELECT SUM(salary) AS engineering_salary_total
FROM employees
WHERE department_id = 2;

Result: This sums the salaries only for employees whose department_id is 2.

3.3. `AVG()`

The AVG() function calculates the average (mean) of all non-NULL values in a numeric column. It ignores NULL values in its calculation.

-- Calculate the average salary of all employees
SELECT AVG(salary) AS average_salary
FROM employees;

Result: This returns the average of all non-NULL salary values. PostgreSQL typically returns a numeric or double precision type for accuracy.

-- Calculate the average salary for 'Software Engineer' job title
SELECT AVG(salary) AS avg_salary_software_engineer
FROM employees
WHERE job_title = 'Software Engineer';

Result: This calculates the average salary specifically for employees with the job title ‘Software Engineer’.

3.4. `MIN()`

The MIN() function finds the minimum (lowest) value in a column among non-NULL values. It works with numeric, string, and date/time types.

-- Find the lowest salary among all employees
SELECT MIN(salary) AS lowest_salary
FROM employees;

Result: Returns the minimum value from the salary column.

-- Find the earliest hire date
SELECT MIN(hire_date) AS earliest_hire_date
FROM employees;

Result: Returns the earliest date from the hire_date column.

3.5. `MAX()`

The MAX() function finds the maximum (highest) value in a column among non-NULL values. It also works with numeric, string, and date/time types.

-- Find the highest salary among all employees
SELECT MAX(salary) AS highest_salary
FROM employees;

Result: Returns the maximum value from the salary column.

-- Find the latest hire date
SELECT MAX(hire_date) AS latest_hire_date
FROM employees;

Result: Returns the most recent date from the hire_date column.

4. Using Aggregate Functions with `GROUP BY`

The true power of aggregate functions is often realized when used in conjunction with the GROUP BY clause (covered in P6). GROUP BY divides the rows of a table into groups based on the values in one or more columns. Aggregate functions then operate independently on each of these groups.

Rule Reminder: When using GROUP BY, any column in the SELECT list that is not an aggregate function must be included in the GROUP BY clause. This is because the database needs to know how to present the non-aggregated values for each group (they must be the same within the group).

Examples:

-- Count the number of employees in each department
SELECT
    department_id,
    COUNT(*) AS num_employees -- Counts all employees within each group
FROM employees
WHERE department_id IS NOT NULL -- Optional: Exclude employees with no department
GROUP BY department_id
ORDER BY department_id; -- Optional: Sort the results

Result: Shows each department_id and the count of employees belonging to it.

-- Calculate the average salary for each job title
SELECT
    job_title,
    AVG(salary) AS average_salary_for_job,
    COUNT(*) AS number_of_employees_with_job
FROM employees
WHERE job_title IS NOT NULL -- Exclude employees with NULL job title
GROUP BY job_title
ORDER BY average_salary_for_job DESC; -- Optional: Sort by average salary

Result: Lists each non-NULL job_title along with the average salary and employee count for that title.

-- Find the minimum and maximum salary within each department
SELECT
    department_id,
    MIN(salary) AS min_salary_in_dept,
    MAX(salary) AS max_salary_in_dept
FROM employees
WHERE department_id IS NOT NULL
GROUP BY department_id
ORDER BY department_id;

Result: Shows each department_id with the lowest and highest salary found within that department.

5. Filtering Groups with `HAVING`

Recall from P6 that the HAVING clause is used to filter the results of a GROUP BY query based on the outcome of aggregate functions. While WHERE filters rows before they are grouped and aggregated, HAVING filters the groups themselves after aggregation.

Examples:

-- Find departments with more than 2 employees
SELECT
    department_id,
    COUNT(*) AS num_employees
FROM employees
WHERE department_id IS NOT NULL
GROUP BY department_id
HAVING COUNT(*) > 2 -- Filter groups based on the aggregate COUNT result
ORDER BY department_id;

Result: Only lists department_ids where the employee count for that group is greater than 2.

-- Find job titles where the average salary is greater than $70,000
SELECT
    job_title,
    AVG(salary) AS average_salary_for_job
FROM employees
WHERE job_title IS NOT NULL AND salary IS NOT NULL
GROUP BY job_title
HAVING AVG(salary) > 70000 -- Filter groups based on the average salary calculation
ORDER BY average_salary_for_job DESC;

Result: Only lists job titles whose average salary exceeds $70,000.

6. Important Considerations

NULL Handling:
- COUNT(*) counts all rows.
- COUNT(column_name) and COUNT(DISTINCT column_name) ignore NULLs in the specified column.
- SUM(), AVG(), MIN(), MAX() all ignore NULL values in their calculations. Be mindful of AVG() especially, as ignoring NULLs can affect the result compared to treating them as zero (if that were desired, you would need to handle NULLs explicitly, e.g., using COALESCE(salary, 0) - though COALESCE might be considered part of P9 standard functions).
Data Types:
- SUM() and AVG() operate on numeric types (INTEGER, NUMERIC, REAL, DOUBLE PRECISION, etc.).
- MIN() and MAX() work on any data type that can be ordered (numbers, strings, dates, etc.).
- COUNT() works with any data type as it primarily counts rows or non-NULL occurrences.
GROUP BY Clause Requirement: As reiterated earlier, non-aggregated columns in the SELECT list must appear in the GROUP BY clause.

Using Aggregate Functions

1. Introduction to Aggregate Functions

2. Setting Up the Sample Data

3. Core Aggregate Functions Explained

3.1. COUNT()

3.2. SUM()

3.3. AVG()

3.4. MIN()

3.5. MAX()

4. Using Aggregate Functions with GROUP BY

5. Filtering Groups with HAVING