Week 9 Lecture: Dataclasses
The Problem with Boilerplate Code
Over the past weeks, we have been writing classes like this:
class Student:
def __init__(self, name, student_id, gpa):
self.name = name
self.student_id = student_id
self.gpa = gpa
Notice how each attribute name is written three times: once in the parameter list, once as self.name, and once as the assigned value. For three attributes, that is nine repetitions just in __init__. A class with ten attributes would require thirty repetitions — and that does not even include __repr__ or __eq__.
A proper Student class with __init__, __repr__, and __eq__ looks like this:
class Student:
def __init__(self, name, student_id, gpa):
self.name = name
self.student_id = student_id
self.gpa = gpa
def __repr__(self):
return f"Student(name={self.name!r}, student_id={self.student_id!r}, gpa={self.gpa!r})"
def __eq__(self, other):
if not isinstance(other, Student):
return NotImplemented
return (self.name == other.name and
self.student_id == other.student_id and
self.gpa == other.gpa)
Nothing here is surprising — it is all predictable, mechanical code. What if the same class could be written in just five lines?
The dataclasses Module
The dataclasses module lets Python automatically generate boilerplate methods for classes that primarily hold data. To use it, import the dataclass decorator from the dataclasses module, apply it to your class, and define your fields using type annotations:
from dataclasses import dataclass
@dataclass
class Student:
name: str
student_id: str
gpa: float
Although these look like class variables, the @dataclass decorator turns them into instance variables by automatically generating an __init__ method that saves them via self. Each variable must include a type annotation.
This single decorator automatically provides __init__, __repr__, and __eq__:
s1 = Student("Alisher", "2024001", 3.8)
s2 = Student("Alisher", "2024001", 3.8)
print(s1) # Student(name='Alisher', student_id='2024001', gpa=3.8)
print(s1 == s2) # True
How @dataclass Works Under the Hood
Recall that a decorator is a function that takes something and returns a modified version of it. The @dataclass decorator takes your class and returns a modified version with auto-generated methods. Since everything in Python is an object — including classes — they can be passed as parameters.
When Python encounters the @dataclass decorator above a class:
- It examines the class body for type-annotated variables.
- It generates an
__init__method using those variable names as parameters. - It generates
__repr__and__eq__methods. - It attaches all generated methods to the class.
Here is a simplified implementation that illustrates the process:
def my_dataclass(cls):
"""A simplified version of Python's @dataclass"""
# STEP 1: Look at the class body for labeled variables (annotations)
# __annotations__ is a hidden dictionary Python creates when you type label variables.
# For example: {'name': <class 'str'>, 'age': <class 'int'>}
fields = cls.__annotations__
# STEP 2: Generate an __init__ method
def new_init(self, *args, **kwargs):
# This matches the passed arguments to the expected fields
for field_name, value in zip(fields.keys(), args):
setattr(self, field_name, value)
for key, value in kwargs.items():
setattr(self, key, value)
# STEP 3: Generate a __repr__ method (for printing)
def new_repr(self):
# Creates a string like: Student(name='Alice', age=20)
field_strings = []
for field_name in fields.keys():
value = getattr(self, field_name)
field_strings.append(f"{field_name}={repr(value)}")
joined_fields = ", ".join(field_strings)
return f"{cls.__name__}({joined_fields})"
# STEP 4: Generate an __eq__ method (for checking ==)
def new_eq(self, other):
if not isinstance(other, cls):
return False
# Check if all fields are identical
for field_name in fields.keys():
if getattr(self, field_name) != getattr(other, field_name):
return False
return True
# STEP 5: Add them to your class
cls.__init__ = new_init
cls.__repr__ = new_repr
cls.__eq__ = new_eq
# Return the newly modified class!
return cls
Default Values
Fields can have default values, specified after the type annotation:
from dataclasses import dataclass
@dataclass
class Student:
name: str
student_id: str
gpa: float = 0.0
year: int = 1
s = Student("Sevara", "2024015")
print(s) # Student(name='Sevara', student_id='2024015', gpa=0.0, year=1)
Important rule: Fields with default values must come after fields without defaults. This is the same rule Python applies to function parameters. If a parameter has a default value, all parameters after it must also have defaults. Otherwise, Python cannot determine which argument maps to which parameter.
@dataclass
class Student:
name: str = "Unknown" # has default
student_id: str # no default — ERROR!
This raises a TypeError: non-default argument 'student_id' follows default argument.
Mutable Default Values and field()
Using a mutable object like a list as a default value is not allowed in dataclasses:
@dataclass
class Student:
name: str
grades: list = [] # ValueError!
Python raises ValueError: mutable default <class 'list'> for field grades is not allowed: use default_factory. The decorator actively prevents the shared-mutable-default bug.
To safely provide a mutable default, import the field function and use default_factory:
from dataclasses import dataclass, field
@dataclass
class Student:
name: str
student_id: str
grades: list = field(default_factory=list)
This tells Python: every time a new Student is created, call list() to produce a fresh empty list. Each object gets its own list — no sharing.
Note that list is passed without parentheses — it is the function itself being passed, not its result. The dataclass will call it later when needed. Writing default_factory=lambda: [] achieves the same result.
Why does
listwork both as a type and as a callable? In Python,listis both a type (used inisinstancechecks) and a callable (used to create new lists). In fact, all types in Python are callable — when you writeStudent("Alisher", "2024001"), you are calling theStudentclass as a function.
s1 = Student("Jasur", "2024020")
s2 = Student("Nodira", "2024021")
s1.grades.append(90)
print(s1.grades) # [90]
print(s2.grades) # [] — safe! not shared
Use field(default_factory=...) any time the default value is a list, dict, set, or any other mutable object.
Adding Methods to a Dataclass
A dataclass is still a regular class. The @dataclass decorator only auto-generates methods like __init__, __repr__, and __eq__ — everything else remains completely normal. You can add instance methods, static methods, or class methods. You can even make a dataclass abstract by inheriting from ABC.
from dataclasses import dataclass, field
@dataclass
class Student:
name: str
student_id: str
gpa: float = 0.0
grades: list = field(default_factory=list)
def add_grade(self, grade):
self.grades.append(grade)
self.gpa = sum(self.grades) / len(self.grades)
def is_passing(self):
return self.gpa >= 60
s = Student("Alisher", "2024001")
s.add_grade(95)
s.add_grade(88)
s.add_grade(92)
print(s)
# Student(name='Alisher', student_id='2024001', gpa=91.666..., grades=[95, 88, 92])
print(s.is_passing()) # True
@dataclass handles the boring parts so you can focus on the interesting ones.
Frozen (Immutable) Dataclasses
Sometimes an object should not be changed after creation. To make a dataclass immutable, pass frozen=True:
from dataclasses import dataclass
@dataclass(frozen=True)
class Point:
x: float
y: float
p = Point(3.0, 4.0)
print(p) # Point(x=3.0, y=4.0)
p.x = 10.0 # FrozenInstanceError!
Setting frozen=True makes the object immutable — no attribute can be changed after creation. Any attempt raises a FrozenInstanceError.
Why is this useful?
- Safety: Some objects should never change, such as geographic coordinates.
- Hashability: Frozen dataclasses can be used as dictionary keys or in sets.
p1 = Point(3.0, 4.0)
p2 = Point(1.0, 2.0)
locations = {p1: "Tashkent", p2: "Samarkand"}
print(locations[p1]) # "Tashkent"
Regular classes and regular dataclasses cannot be used as dictionary keys because mutable objects are not hashable by default.
Hashability means Python can compute a fixed number (a hash) from the object. This hash is used for fast lookups in dict and set. If an object can change, its hash could change too, breaking lookups. Therefore, Python only allows immutable objects to be hashable. Setting frozen=True guarantees the object will not change, making it hashable.
__post_init__: Running Code After Initialization
Sometimes the auto-generated __init__ is not enough — you may need to compute a value from other fields. Consider a Rectangle with width and height provided by the user, and an area that should be calculated automatically.
The problem: every type-annotated variable becomes a parameter in __init__. Writing area: float would make Python expect area as an argument. To exclude it from the constructor, use field(init=False):
from dataclasses import dataclass, field
@dataclass
class Rectangle:
width: float
height: float
area: float = field(init=False) # not a constructor parameter
Now area will not appear in __init__, so you create a rectangle with just Rectangle(5.0, 3.0). But area still needs a value — this is where __post_init__ comes in. It is a special method that runs immediately after the auto-generated __init__ finishes:
def __post_init__(self):
self.area = self.width * self.height
The sequence of events when you write Rectangle(5.0, 3.0):
- Python calls the auto-generated
__init__, which setswidthandheight. - At the end of
__init__, Python automatically calls__post_init__. - Inside
__post_init__,areais calculated.
r = Rectangle(5.0, 3.0)
print(r) # Rectangle(width=5.0, height=3.0, area=15.0)
print(r.area) # 15.0
Note that area appears in the __repr__ output. Normally, __repr__ shows how to recreate the object, but trying Rectangle(5.0, 3.0, 15.0) would raise an error since area is not a constructor parameter. If you want to be strictly accurate, you can hide it with repr=False:
area: float = field(init=False, repr=False)
Why declare area as a field instead of just assigning it in __post_init__? You could simply write self.area = self.width * self.height in __post_init__ without declaring it as a field. However, if area is not declared as a field, the dataclass does not know about it — it will not appear in __repr__ and will not be used in __eq__ comparisons. Using field(init=False) makes area a proper dataclass field, included in both __repr__ and __eq__.
Ordering with order=True
The @dataclass decorator auto-generates __eq__, but not comparison methods (__lt__, __gt__, __le__, __ge__). To get those as well, use order=True:
from dataclasses import dataclass
@dataclass(order=True)
class Student:
gpa: float
name: str
student_id: str
Ordering is the ability to compare two objects. Think of five students standing in no particular arrangement — there is no order. Now ask them to line up from lowest GPA to highest, and you have an order that allows comparison.
How the comparison methods work: The generated methods put all attributes into a tuple in the order they are defined, and compare the tuples. Recall how tuple comparison works in Python:
(3.8, "Alisher") > (3.5, "Sevara")isTruebecause3.8 > 3.5.(3.8, "Alisher") > (3.8, "Zafar")isFalsebecause the first elements are equal, so Python compares the second:"A" < "Z".
The generated __lt__ method looks like this:
def __lt__(self, other):
if not isinstance(other, Student):
return NotImplemented
# It compares them as if they were tuples of their fields
return (self.gpa, self.name, self.student_id) < (other.gpa, other.name, other.student_id)
Field order matters. The first field becomes the primary sort key. If the first fields are equal, comparison moves to the second field, and so on.
from dataclasses import dataclass
@dataclass(order=True)
class Student:
gpa: float # Primary sort key
name: str # Secondary (tie-breaker)
student_id: str # Final tie-breaker
s1 = Student(3.8, "Alisher", "2024001")
s2 = Student(3.5, "Sevara", "2024015")
print(s1 > s2) # True (3.8 > 3.5)
students = [s1, s2, Student(3.8, "Zafar", "2024020")]
print(sorted(students))
# Sorted by gpa first, then name.
# "Alisher" (3.8) will come before "Zafar" (3.8).
When to Use Dataclasses
- Use
@dataclasswhen your class is primarily about storing data and you are comfortable with auto-generated magic methods. - Use a regular class when your class is primarily about behavior or you need fine-grained control over its methods.
Type Hints for Collections
Beyond simple types, Python supports type hints for collections and nested structures:
names: list[str] = ["Alisher", "Sevara", "Jasur"]
scores: dict[str, int] = {"Alisher": 95, "Sevara": 88}
coordinates: tuple[float, float] = (41.2995, 69.2401)
unique_ids: set[int] = {101, 102, 103}
For more complex nested data:
all_grades: list[list[int]] = [[90, 85, 88], [76, 92], [100, 95, 89, 91]]
student_grades: dict[str, list[int]] = {
"Alisher": [90, 85, 88],
"Sevara": [76, 92, 100],
}
leaderboard: list[tuple[str, int]] = [("Alisher", 95), ("Sevara", 88), ("Jasur", 72)]
course_results: dict[str, dict[str, int]] = {
"OOP": {"Alisher": 90, "Sevara": 85},
"Calculus": {"Alisher": 78, "Jasur": 92},
}
Optional and Union Types
When a value might be None, use the union syntax:
def get_phone(student_id: str) -> str | None:
...
When a value can be one of several types:
def find_student(identifier: str | int):
...
Combining Dataclasses with Type Hints
Here is a complete example bringing everything together:
from dataclasses import dataclass, field
@dataclass
class Student:
name: str
student_id: str
gpa: float = 0.0
grades: list[int] = field(default_factory=list)
email: str | None = None
courses: dict[str, str] = field(default_factory=dict)
s = Student(
name="Nodira",
student_id="2024030",
gpa=3.7,
grades=[90, 85, 92],
email="nodira@alkhu.uz",
courses={"CS201": "OOP", "MATH101": "Calculus"}
)
Forward References
Consider a class that references itself:
@dataclass
class Employee:
name: str
manager: Employee # ERROR!
When Python encounters manager: Employee, the Employee class is not yet fully defined — it raises a NameError. The fix is to use a string instead:
@dataclass
class Employee:
name: str
manager: 'Employee' # OK — forward reference
By using a string, you tell Python: “do not resolve this name now; check it later.” This is called a forward reference.
Forward references are needed in two situations:
- Self-reference: A class refers to itself (e.g., an employee whose manager is also an employee).
- Cross-reference: A class refers to another class defined later in the file.
@dataclass
class Gradebook:
students: list['Student'] # Student is defined below
@dataclass
class Student:
name: str
Type-Hinting Callable Objects
Functions passed as arguments can also be type-hinted using Callable from the typing module:
from typing import Callable
def apply_operation(x: int, y: int, operation: Callable[[int, int], int]) -> int:
return operation(x, y)
def add(a: int, b: int) -> int:
return a + b
result = apply_operation(5, 3, add) # 8
Callable[[int, int], int] means a function that takes two int parameters and returns an int. The first part (inside the inner brackets) specifies the parameter types, and the second part specifies the return type.