Python dataclasses: A revolution

Python data classes are a new feature in Python 3.7, which is currently in Beta 4 and scheduled for final release in June 2018.  However, a simple pip install brings a backport to Python 3.6.  The name dataclass sounds like this feature is specifically for classes with data and no methods, but in reality, just as with the Kotlin dataclass, the Python dataclass is for all classes and more.

I suggest that the introduction of the dataclass will transform the Python language, and in fact signal a more significant change than the move from Python 2 to Python 3.

The Problems Addressed By Dataclasses

There are two negatives with Python 3.6 classes.

‘hidden’ Class Instance Variables

The style rules for Python suggest that all instance variables should be initialised (assigned to some value) in the class __init__() method.  This at least allows scanning __init__ to reverse engineer a list of class instance variables. Surely an explicit declaration of the instance variables is preferable?

A tool such as PyCharm scans the class __init__() method to find all assignments, and then any reference to an object variable that was not found in the __init__() method is flagged as an error.

However, having to discover the instance variables for a class by scanning for assignments in the __init__() method is a poor substitute for scanning a more explicit declaration.  The body of the __init__() method essentially becomes part of the class statement.

The dataclass provides for much cleaner class declarations.

Class Overhead for Simple Classes

Creating basic classes in Python is too tedious. One solution is for programmers to use dictionaries as classes – but this is bad programming practice. Other solutions include namedtuples, the Struct class from the ObjDict package or the attrs package. With this number of different solutions, it is clear that people are looking for a solution.

The dataclass provides a cleaner, and arguably more powerful syntax than any of those alternatives, and provides the stated Python goal of one single clear solution.

Dataclass Syntax Example

Below is an arbitrary class to implement an XY coordinate and provide addition and a __repr__ to allow simple printing.

class XYPoint:

    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __add__(self, other):
        new = XYPoint(self.x, self.y)
        new.x += other.x
        new.y += other.y
        return new

    def __repr__(self):
        printf(f"XYPoint(x={self.x},y={self.y})")

Now the same functionality using a data class:

@dataclass
class XYPoint:
        x:float
        y:float

    def __add__(self, other):
        new = XYPoint(self.x, self. y)
        new.x += other.x
        new.y += other.y

The dataclass is automatically provided with an __init__() method and a __repr__() method.

The class declaration now has the instance variables declared at the top of the class more explicitly.

The type of ‘x’ and ‘y’ are declared as float above, although to exactly match the previous example, they should be of type Any but float may be more precise, and more clearly illustrates that annotation is usually a type.

The dataclass merely requires the class variables to be annotated using variable annotation.  ‘Any‘ provides a type completely open to any type, as is traditional with Python.  In fact, the type is currently just for documentation and is not type checked, so you could state ‘str‘ and supply an ‘int‘, and no warning or error is raised.

As you can see from the example, dataclass does not mean the class implemented is a data only class, but rather the class contains some data which almost all classes do. There are code savings for simple classes, mostly around the __init__ and __repr__ and other simple operations with the data, but the cleaner declaration syntax could be considered the main benefit and is useful for any class.

When to Use Dataclasses

The Candidates

The most significant candidates are any Python class and any dictionary used in place of a class.

Other examples are namedtuples and Struct classes from the ObjDict package.

Code using the attrs package can migrate to the more straightforward dataclass which has improved syntax at the expense of losing attrs compatibility with older Python versions.

Performance Considerations

There is no significant change to performance by using a dataclass in place of a regular Python class. A namedtuple could be slightly faster for some uses of immutable, ‘method free’ objects, but for all practical purposes, a dataclass introduces no performance overhead, and although there may be a reduction in code, this is insignificant. This video shows the results of actual performance tests.

Compatibility Limitations?

The primary compatibility constraint is that dataclasses require Python 3.6 or higher. With Python 3.6 being released in 2016, most potential deployments are well supported, leaving the main restriction as, not being available in Python 2.

The only other compatibility limitation applies to classes with existing type annotated class variables.

Any class which can limit support to Python 3.6+, and does not have type annotated class variables, can add the dataclass decorator without compatibility problems.

Just adding the dataclass decorator does not break anything, but without then adding data fields also, it does not bring significant new functionality either. ?? But the compatibility means adding data fields can be incremental as desired, with no step to ensure compatibility. ??

New capabilities do not guarantee delivery without code to make use of those capabilities. Unless the class is part of a library merely getting a spec upgrade, conversion to dataclasses makes most sense either when refactoring for readability or when code makes use of one or more functions made available by converting to a dataclass.

The ‘free’ Functionality

In addition to the clean syntax, the features provided automatically to dataclasses are:

  • Class methods generated automatically if not already defined
    • __init__ method code to save parameters
    • __repr__ to allow quick display of class data, e.g. for informative debugging
    • __eq__ and other comparison methods
    • __hash__ allowing a class to function as a dictionary key
  • Helper functions (see PEP for details)
    • fields() returns a tuple of the fields of the dataclass
    • asdict() returns a dictionary of the class data fields
    • astuple() returns a tuple of the dataclass fields
    • make_dataclass() as a factory method
    • replace() to generate a modified clone of a dataclass
    • is_dataclass
  • New Standardized Metadata
    • more information in a standard form for new methods and classes

Dataclass Full Syntax & Implementation

How Dataclasses Work

dataclass is based on the dataclass decorator. This decorator inspects the class, generates relevant metadata, then adds the required methods to the class.

The first step is to scan the class __annotations__ data. The __annotations__ data has an entry for each class level variable provided with an annotation.  Since variable annotations only appeared in Python 3.6, and class level variables are not common, there is no significant amount of legacy code with annotated class level variables.

This list is scanned for actual values of these class level variables which are of the type field. Values of type field can contain additional data for building the metadata which is stored in __dataclass_fields__ and  __dataclass_params__. Once these two metadata dictionaries are built, the standard methods are then added if they are not already present in the class. Note while an __init__() method blocks the very desirable boilerplate removing automatic __init__ method, simply renaming __init__ to __post_init__ allows retaining any code desired in an __init__, and removing the distracting boilerplate.

This process means that any class level variables that are not decorated are ignored by the dataclass decorator and not impacted by the move to a data class.

Converting Class

Consider the previous example, which was very simple. Real classes have default __init__ parameters, instance variables that are not passed to __init__, and code that will not be replaced with the automatic __init__. Here is a slightly more complicated contrived example to cover those complications with a straightforward use case.

This example adds a class level variable, last_serial_no, just to have an example of a working, class level variable, which allows a counter of each instance of the class.

Also added is serial_no which holds a serial number for each instance of the class.  Although it makes more sense to always increment the serial number by 1, an optional __init__ parameter allows incrementing by another value, showing how to deal with __init__ parameters which cannot be processed by the default __init__ method.

class XYPoint:

    last_serial_no = 0

    def __init__(self, x, y=0, skip=1):
        self.x = x
        self.y = 0
        self.serial_no = self.__class__.last_serial_no + skip
        self.__class__.last_serial_no = self.serial_no

    def __add__(self, other):
        new = XYPoint(self.x, self. y)
        new.x += other.x
        new.y += other.y
        return new

    def __repr__(self):
        printf(f"XYPoint(x={self.x},y={self.y})")

Now the same functionality using a dataclass.

from dataclasses import dataclass, field, InitVar

@dataclass
class XYPoint:
    last_serial_no = 0
    x: float
    y: float = 0
    skip: InitVar[int] = 1
    serial_no: int = field(init=False)

    def __post_init__(self, skip):
        self.serial_no = self.last_serial_no + self.skip
        self.__class__.last_serial_no = self.serial_no

    def __add__(self, other):
        new = XYPoint(self.x, self. y)
        new.x += other.x
        new.y += other.y

The class level variable without annotation needs no change. The __init__ parameter that is not also an instance variable has the InitVar type wrapper. This ensures it is passed through to __post_init__  which provides all __init__ logic that is not automatic.

The serial number is an instance variable or field that is not in the init, and to change default settings for a field, just assign a value to the field (which can still include a default value as a parameter to the field).

I think this example covers every realistic use requirement to convert any existing class.

Types and Python

Dataclasses are based on usage of annotations. As noted in annotations, there is no requirement that annotations be types. The reason for providing annotations was primarily driven by the need to allow for third-party type hinting.

Dataclasses do give the first use of annotations (and by implication, potentially types) in the Python standard libraries.

Annotating with None or docstrings is possible. There are many in the Python community adamant that types will never be required, nor become the convention. I do see optional types slowly creeping in though.

Issues and Considerations

It is possible there are some issues with existing classes which use class level variables and instance variables, but none have been found so far, which this leaves this section as mostly ‘to be added’ (check back).

Conclusion

There is a strong case that all classes, as well as namedtuples, and even other data not currently implemented as a class and some other constructs better implemented as classes, should move to dataclasses. For small classes, and all classes start as small classes, there is the advantage of saving some boilerplate code. Reducing boilerplate code makes it easier to maintain, and ultimately more readable.

Ultimately, the main benefit is any class written using a dataclass is more readable and maintainable than without dataclasses. Converting existing classes is as simple as renaming.

1 thought on “Python dataclasses: A revolution”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s