Data Classes: The end of ‘Faux Objects’?

This page introduces the concept of ‘Faux Objects’: the use of a collections to avoid need to declare a class to represent data,  and data classes, the solution that makes removes the barrier to declaring a class for storage as an object.

Subtopics:

Using an already existing collection type to hold data in place of create  a class avoids the need to:

  • a class declaration in code
  • define a ‘constructor’ (the __init__() method in Python)
  • create a conversion to string or an __repr__
  • create comparison operators

Lists, tuples, dicts etc have everything in place already, so why not simply use one of these as a ‘readymade’ solution?

The answer is that a collection is designed to be a collection of like things. A collection is about a plural of information that is effectively the same type of data. Objects are about different pieces of data which all have their own unique role. Each piece of data has its own discrete meaning and special uses.  Use of a collection (as list, dict, or tuple etc) obscures the roles of the specific components of data.

The contention of this page is that from the outset with Kotlin, and with Python from version 3.6, the availability of data classes, automatically provides all those standard functions of lists or dictionaries/maps, and reduces the steps to create a class to a single step. reducing all the steps to just one step. It is time to stop coding classes as collections, and where possible, refactor old code guilty of this crime against readable code.

‘faux objects’: When is a collection a faux object?

All collections, dictionaries, lists, tuples, named tuples etc, have indexes. With dictionaries/maps the indexes are called ‘keys’, but appear in [] and behave as the equivalent of an index in that they are used to retrieve items from the collection.

A faux object is a collection where index values determines the nature of the data. Program code is then designed around specific index values.

A genuine collection is a collection where value of index has no impact on the program code.

Consider some data read from a file to describe some students at an online university.  Each line of the file has ‘first name’, ‘last name’, age, and city.

So two lines of the file might be:

bill, smith, 23, new york
tom jones, 21, san Francisco

Consider this data read into a program as a list of lines, with each line in turn split using ‘,’ to form a list of values for that line.

This could be considered a collection of ‘students’, with each student element being a list of ‘attributes’.  So a collection(list) with each element also a collection (again a list).

Now consider a function to print the full name of a student.  This function would require specifically working with field[0] and field[1] of the a student collection, and creating behaviour specific to those specific indexes, as no other index plays the same role.  A function such as this, which assigns the specific significance to field 0 and 1, has made the each student collection a ‘faux object’, because each item has its own significance.  The data is no longer a collection of ‘like’ fields, but of data specific in nature.

The Dual Nature of Data: Object and Collection.

Consider the same example of ‘students’ as above, only this time for variation each line is processed with: dict(zip((“first”,”last”,”age”,”city”),line.split(‘,’))

This would give a dictionary as the collection for each person. Now consider two functions to process a student: printout and is_adult.  Now we have:

students = []
keys = "first", "last", "age", "city"
with open("data") as lines:
    for line in lines:
        students.append(dict(zip(keys, line.strip().split(',")))

def printout(student):
    text = ""
    for k, v in student.items():
        text += f"{k}: {v}\n"
        print(text)
def is_adult(student) -> bool:
    return int(student["age"]) >= 18

The printout() function treats the ‘student’ data purely as a collection, as from the perspective of printout(), all the data is of the same nature.  No item in the collection is treated as having any special significance. It makes no difference to ‘printout()‘ what matter what keys or values for those keys are present. For the purposes of ‘printout()‘ the data is purely a collection of values with keys. A genuine collection.  

Conversely, the is_adult() function will requires a specific entry ‘age‘, and requires that ‘age‘ has a numberic value. The function is predicated on the special meaning of a specific entry in the ‘collection’, which is not treating the data as simply a collection of values.  Therefore, is_adult() is treating the data as an object. As soon as code that requires specific keys to be present is required, that data is no longer purely a collection of values, and to store the data as collection results in a ‘faux object’:  data which represents an object, but is stored as a collection.

This is because ‘is_adult()’ is a method of the ‘student’ object, while ‘printout()’ is a generic method that could apply to any object. Methods which operate independant of the object see the object simply a collection of fields, while methods of the object itself are aware of what each field represents.

However, traditionally declaring a class for the object loses the ability to also process the data as a collection. Just because there are functions for the data which determine the nature of the data is an object, does not preclude there also being useful functions which can ignore the specific nature of the items of the data. Ideally every function for that data will respect the special significance of the individual fields, but to generic functions the special significant of specific fields is of no consequence.

A printout that combines first and last name, and perhaps puts ‘from’ with city may be an improvement on printout that simply lists each field name and value, but if the printout is only needed for debugging, then why bother? A method common to all objects is fine.

So even data that is, for many purposes a specific type of object, can also have functionality that is generic in nature.  Functions which can view the data as generic data, are functions that are common to all objects, or at least a wide range of objects, and are similar to those functions available for collections, as they treat the objects as collections of fields.  So why not generic function normally available for collections, also available for objects?  Enter data classes.  All, or at least all the relevant generic functions of a collection, but still with the specific data definitions of an object. For more complex objects, these generic functions tend to be little value, and therefore data classes allow overriding each generic function with a class specific implementation. The generic functions provide a ‘starter kit’, and in some cases, the ‘starter’ version of some or all of these functions are adequate and in these cases, boiler-plate code is reduced.

Spotting ‘Faux Objects’ vs True Collections

Naming: Singular vs Plural

The concept of a collection is a plurality of items.  A collection of items is a plural of something. The file can be consider a collection of lines. Converting those lines to persons gives a collection of persons where each entry is a person.

Each student can be considered a collection fields in the case of the printout method where all fields are treated in the same manner.  But the is_adult() method is based on the ‘age’ being not just another field, but having specific significance.  Considering the details as  first name, last name, age and city does not give any ‘plurality’.  The only ‘plurality’ would be ‘group_of_fields’, and if the data was just considered fields, but an entry in the collection ‘students’ would be a ‘student’, not ‘group_of_fields’, indicating the main idea of the data is not to consider the student data as a collection(which would be plural) but an object(which is singular).

Predetermined Keys vs Field Names.

If all the ‘keys’ in a dictionary/map, or all the entries in a list, can be identified, and named in advance, then they are not keys or entries in a list. They are field names. Fields can be optional and therefore not present in case, but even being able to name all possible entries in a collection generally indicates it is not really in a collection. An example would be a dictionary that can be defined to the elements “first name”, “last name”, “age”, “city”. This is clearly not a collection of like data, each element has a specific usage.

Faux Objects: Literals as Indexes

Another strong indicator as to a collection being in fact a ‘faux object’, is the use of literals as indexes or keys. In a true collection, the index will generally be from a variable, because the item to work with will be decided by input, or by iterating through a loop, or by some other data in the program as no entry in a collection is inherently different to any other.  Literals for indexes/keys will normally only be found in test routines: “does the second one have the value it should?”.

Note the “printout” funtion which does treat the ‘student’ collection as a genuine collection has no use of literal keys.

Contrast this with the ‘is_adult’ function which uses the key “age” as a literal value. Faux objects build using collections, make use of  literal values for indexes/keys, as the code for ‘student’ will make use special of each attributes of a student collection. But contrast, from the ‘students’ collection, most code would either loop through all students, or where a specific student to be displayed or edited, which student will come from login data or from a selection made by data entry.  Code ‘hard wired’ to work with specifically with  students[2] would be suspicions, while code to work specifically with the age field with a student, which is which because field[2] within a student record, is to be expected.

Why Fake Objects Should Be Eliminated

The problems are:

Documentation: What the program is doing is simply less clear. Using a collection means there is no declaration of the class, and that declaration using code provides irreplaceable documentation. The code itself being documentation is always the ideal, as changes to the code always remain in sync with the documentation. Without a declaration, the missing clues such as the class name, the list of properties being itemised are all missing, as is the simple message that there is a class. You can add all of this as comments, but will it be present and updated and in some standard form as it should be?

Error Prone: Keys are not checked for validity in many circumstances, even at run time, which can lead to hidden errors.

Exendability: With a class, as the program grows, methods can be added and other features that simply are not possible with a collection. Functions that use the collection, particularly with no type specification possible, simply do not result in as cohesive solution as methods.  Refactoring support is poor, since literals are used in place of identifiers, and tooling can be just as mislead as people reading the code.

All of these problems are insignificant when a program is a small development by a single person that will not be maintained.  However, the longer the program will live, and the more people different people who may revise the code, the more significant they become.

Why do Faux Object exist?

There are three reasons for Faux Objects:

  • to use a collection to avoid the overhead of creating a class
  • to enable the an object to be used as a collection
  • in some languages, the disticntion can appear blurred
  • to dynamically alter an object at run time

The Use of a Collection to Avoid the Overhead of Creating a Class

Prior to the introduction of dataclasses, python classes required more boiler plate than simply using a collection.  In addition to an __init__ method, a __repr__ is almost always needed if there is any debugging and many times an __eq__ method is usefull, if not other comparisons.  All of these methods are now automatically available with dataclasses.

In fact, the problem was so common, that in addition to the practice of using tuples, lists or dicts,  other solutions to avoid needing to create a class from scratch include:  namedtuple, attrs, typing.NamedTuple, namedlist, attrdict, plumber, objdict, and fields.  All of these because creating a class the standard way involved too much boilerplate.  This creates the opposite of the python ideal: “There should be one– and preferably only one –obvious way to do it.”  (try: “import this” to see this text inside Python in idle). I think that is 11 ways, and the list is not exhaustive.

So, prior to Data Classes in Python, this overhead was often a compelling reason.  Now it is not.

To Enable A Collection Perspective of Data.

Every object is also just a collection of data and methods, as long as you can ignore the significance of specific fields.  There are function where ignoring the significance of the data makes sense, although these can inherently use the same code for every object, so most of them are already available and could be inhereted from the base class of all objects.  Examples are to_string() or where the reader sees the significance, not the program or eq() for simply objects by just comparing all data.  The ‘beginner trap’ is to be mislead into using a collection in order to enable this use case, when there are better ways to get access to what in the end are generic functions.  Also, the collection view of objects is available anyway. In Python asdict(object) will convert the data of a dataclass to a dictionary for this use, and object.__dict__ will obtain a dictionary of instance data from a standard object.  It is that simple. So again, this is not a valid reason for faux objects.

In Some Languages, the distinction between Objects and Collections is confused.

In dynamically typed languages such as Python,  Javascript and Groovy, the line between maps(called dictionaries in Python) and objects can be blurred.

In Javascript, objects were originally also to store maps, and the notation for accessing data from maps and objects is basically the same javascript.  In Groovy, Maps allow entries to be accessed using dot notation, giving the impression that a program is accessing an very dynamic object, when the data is a map.

And in Python, classes and objects are all built using dictionaries underneath. But in each language it is recognised that maps and objects have discrete purposes, and clean code will use them as intended. Even with Javascript where originally maps and objects were interchangeable, the language was extended to provide purpose specific solutions.

Despite the similarities with dynamic types, there are distinct objects and maps/dictionaries in each language, allowing use of the appropriate map for collections, and classes for objects.  But because of those similarities, when porting code from one language to another, it can be difficult to determine what is an object or collection in the original language, resulting in the wrong selection between object/collection in the converted code.

Dynamically Altering Classes At Run Time.

Dynamically typed languages actually allow classes to be altered at run time! This is not a valid reason to use a map in place of a class unless coding in a statically typed language, and even then, the reason why a classes is to be changed at run time needs a very strong case. Unless really thought out and alternatives eliminated, do not do this anyway.

Data Classes

What is a Data Class?

A data class is It is worth reviewing the talk given by one of the Python team introducing dataclasses to the Python Conference.

The main concept is depite name, dataclasses are for any purpose, not just for holding data.  Dataclasses pre-configured with constructors, string conversion comparison and optionally even more functionality. All the methods to start are provided, so just add data definition, and you have a class.

Data classes are full regular classes, just with some prefab methods already in place, and those methods can be replaced if desired, so generally, any class can be built using a data class.

Kotlin Data Classes

Here is a kotlin data class for the ‘Student’. In fact the data class defintion could finish at the ‘)’ on the first line, but the example adds an alternate constructor to allow instancing a student from a list of strings as would be returned by split(",").

data class Student(val name:String, val age: Int, val city: String){
   // constructor below added only as an example to allow instancing from a string
   constructor(list: List):this(list[0], list[1].toInt(), list[2])
}

The only required overhead compared using a list of strings for storing the data for each Student, is the data class declaration itself.  This is not really overhead,  as the program would instead need a comment explaing how things are stored, and a comment would have the problems of comments and be no shorter. It still has to explain that the data is to consist of these four fields!

The standard constructor is automatic, simply provide four parameters and they will be applied to the declared data.   The optional extra constructor provided here, does make working with data extracted from an input line and then split(), easier to work with, but  is not required to for the class. More details on the Kotlin data class are here, this section is just to get two points across:

  • data classes remove any reason for faux objects
  • data classes can be used for any class, but the larger the class, the smaller the payback over just a standard class

Python Dataclasses

Here is the Python syntax for dataclasses. The ‘from … import’ is only required once per file, is at least ‘shared boilerplate’.  The code declares each ‘variable’ within the class on separate lines, generating more lines, but with simple and concise code.

No ‘post_init‘ method is required for data classes, but it does allow additional code to the automantically generated init method, while still making use of the automatic init.   This ‘post_init’ ensures age actually is an int, making code using the class simpler.

from dataclasses import dataclass

@dataclass
class Student:
    name: str
    age: int
    city: str

   def __post_init__(self):  # not needed...example only
        self.age = int(self.age)

Almost all of the code is informative, declaring what the data is to both language and reader alike. Yes, the type information is not processed by the Python language itself, and therefore can be misleading if the code does not follow what is declared, but it does allow a declaration, and there is tooling available to check type consistency.  A more complete guide to dataclasses is available here. This page is focused on why to use dataclasses, and official documentation already covers ‘how’.

Conclusion: When to move from collections & faux objects to classes?

In coding solutions to problems, the choice of how to store the data can be a choice between a custom class for that data, or simply storing the data with a ready made collection creating a ‘faux object’.

The problem with the ‘faux object’ is the data declaration basically states all fields are the same, and then comments are required to declare why that is not the case.  As in the example of  with “name” and “age”, which are clearly fields that are not the same.  The result is that real classes mean less code overall, and a clearer indication of intent.

All classes stored as collections should be converted as soon as possible to data classes.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s