Data Classes: The end of ‘Faux Objects’?

This page introduces the concept of ‘Faux Objects’: the use of a collections to avoid need to declare a class to represent data,  and data classes, the solution that makes removes the barrier to declaring a class for storage as an object.

Using an already existing collection type to hold data in place of create  a class avoids:

  • declare a class
  • define a ‘constructor’ (the __init__() method in Python)
  • create a conversion to string or an __repr__
  • create comparision operators

Lists, tuples, dicts etc have everything in place already, so why not simply use one of these as a ‘readymade’ solution?

The answer is that a collection is designed to be a collection of data which all about a plural of the same kind of data, and objects are about different pieces of data which all have their own unique role.  Use of a collection (as list, dict, or tuple etc) obscures the roles of the specific components of data.

The contention of this page is that with first Kotlin, and now Python, featuring data classes reducing all the steps to just one step, it time to stop coding classes as collections, and where possible, refactor old code guilty of this crime against readable code.

  • When is a collection actually a ‘Faux Object?’
  • The Dual Nature of Data: Object and Collection.
  • Spotting Genuine Collections vs ‘Faux Objects’
  • Why do Faux Object exist?
  • Kotlin Data Classes
  • Python Data Classes
  • Conclusion: When to move from faux collections to classes?

‘faux objects’: when is a collection a faux object?

All collections, dictionaries, lists, tuples, named typles etc, have indexes, which in the of dictionaries are called ‘keys’ but appear in [] and behave as the equivalent of an index in that they are used to retrieve items from the collection.

A faux object is a collection where the index value determines significance item in the collection to the program code.

A genuine collection is a collection where value of index has no impact on the program code.

Consider some data read from a file to describe some students at an online university.  Each line of the file has ‘first name’, ‘last name’, age, and city.

So two lines of the file might be:

bill, smith, 23, new york
tom jones, 21, san Francisco

Consider this data read into a program as a list of lines, with each line in turn split using ‘,’ to form a list of values for that line.

This could be considered a collection of ‘students’, with each student element being a list of ‘attributes’.  So a collection(list) with each element also a collection (again a list).

Now consider a function to print the full name of a student.  This function would require specifically working with field[0] and field[1] of the a student collection, and creating behaviour specific to those specific indexes, as no other index plays the same role.  A function such as this, which assigns the specific significance to field 0 and 1, has made the each student collection a ‘faux object’, because each item has its own significance.  The data is no longer a collection of ‘like’ fields, but of data specific in nature.

The Dual Nature of Data: Object and Collection.

Consider the same example of ‘students’ as above, only this time for variation each line is processed with: dict(zip((“first”,”last”,”age”,”city”),line.split(‘,’))

This would give a dictionary as the collection for each person. Now consider two functions to process a student: printout and is_adult.  Now we have:


students = []
keys = "first","last","age","city"
with open("data") as lines:
    for line in lines:
        students.append(dict(zip(keys,line.strip().split(',")))

def printout(student):
   str = ""
   for k,v in student.items():
      str += f"{k}: {v}\n"
   print(str)
def is_adult(student) -> bool:
   return int(student["age"]) >= 18

The printout() function actually treats the ‘student’ data as a genuine collection, as from the perspective of printout, all the data is of the same nature.  No item in the collection is treated any differently than any other item in the collection, which make the data a genuine collection.  Conversely, the is_adult() function, is based on special meaning of a specific entry in the ‘collection’, and requires there to be a ‘age’ data item and age has a numeric value.  Therefore, is_adult() is treating the collection as an object, so the data in the collection is a ‘faux object’:  data which represents an object, but is stored as a collection.

So even data that is for many purposes by nature a specific type of object, can also have functionality that is generic in nature.  But funtions which view the data are generic data, are functions that are common to all objects.  So why not a type of object that already has a set of generic functions?  Enter data classes.  All the generic functions of a collection, but with the specific data defintion of an object.

Spotting ‘Faux Objects’ vs Genuine Collections

The Declarable Fields.

If you can give an individual, and logical, and non generic name to each element of data, then that data is logically an object, not a collection. An example would be a dictionary that can be defined to the elements “first name”, “last name”, “age”, “city”. This is clearly not a collection of like data, each element has a specific usage.

Naming: Singular vs Plural

The concept of a collection is a plurality of items.  A collection of items is a plural of something. The file can be consider a collection of lines. Converting those lines to persons gives a collection of persons where each entry is a person.

Each student can be considered a collection fields in the case of the printout method where all fields are treated in the same manner.  But the is_adult() method is based on the ‘age’ being not just another field, but having specific significance.  Considering the details as  first name, last name, age and city does not give any ‘plurality’.  The only ‘plurality’ would be ‘group_of_fields’, and if the data was just considered fields, but an entry in the collection ‘students’ would be a ‘student’, not ‘group_of_fields’, indicating the main idea of the data is not to consider the student data as a collection(which would be plural) but an object(which is singular).

Faux Objects: Literals as Indexes

Another strong indicator as to a collection being in fact a ‘faux object’, is the use of literals as indexes or keys. In a true collection, the index will generally be from a variable, because the item to work with will be decided by input, or by iterating through a loop, or by some other data in the program as no entry in a collection is inherently different to any other.  Literals for indexes/keys will normally only be found in test routines: “does the second one have the value it should?”.

Note the “printout” funtion which does treat the ‘student’ collection as a genuine collection has no use of literal keys.

Contrast this with the ‘is_adult’ function which uses the key “age” as a literal value. Faux objects build using collections, make use of  literal values for indexes/keys, as the code for ‘student’ will make use special of each attributes of a student collection. But contrast, from the ‘students’ collection, most code would either loop through all students, or where a specific student to be displayed or edited, which student will come from login data or from a selection made by data entry.  Code ‘hard wired’ to work with specifically with  students[2] would be suspicions, while code to work specifically with the age field with a student, which is which because field[2] within a student record, is to be expected.

The Reasons To Avoid Fake Objects

The problems are:

Documentation: What the program is doing is simply less clear. Using a collection means there is no declaration of the class, and that declaration using code provides irreplacable documenation. Without a declaration, the missing clues such as the class name, the list of properties being itemised are all missing, as is the simple message that there is a class.

Error Prone: Keys are not checked for validity in many circumstances, even at run time, which can lead to hidden errors.

Exendability: With a class, as the program grows, methods can be added and other features that simply are not possible with a collection. Functions that use the collection, particularly with no type specification possible, simply do not result in as cohesive solution as methods.  Refactoring support is poor, since literals are used in place of identifiers, and tooling can be just as mislead as people reading the code.

All of these problems are insignificant when a program is a small development by a single person that will not be maintained.  However, the longer the program will live, and the more people different people who may revise the code, the more significant they become.

Why do Faux Object exist?

There are three reasons for Faux Objects:

  • to use a collection to avoid the overhead of creating a class
  • to enable the an object to be used as a collection
  • in some languages, the disticntion can appear blurred
  • to dynamically alter an object at run time

The Use of a Collection to Avoid the Overhead of Creating a Class

Prior to the introduction of dataclasses, python classes required more boiler plate than simply using a collection.  In addition to an init method, a repr is almost always needed if there is any debugging and an eq method, if not other comparisions.  All of this is now automatically available with dataclasses.

In fact, the problem was so common that in addition to the practice of using tuples, lists or dicts,  other solutions to avoid needing to create a class from scratch include:  namedtuple, attrs, typing.NamedTuple, namedlist, attrdict, plumber, objdict, and fields.  All of these because creating a class the standard way involved too much boilerplate.  This creates the opposite of the python ideal: “There should be one– and preferably only one –obvious way to do it.”  (try: “import this” to see this text inside Python in idle). I think that is 11 ways, and the list is not exhaustive.

So, prior to Data Classes in Python, this overhead was a valid reason.  Now it is not.

To Enable A Collection Perspective of Data.

Every object is also just a collection of data and methods, as long as you can ignore the significance of specific fields.  There are function where ignoring the significance of the data makes sense, although these can inherently use the same code for every object, so most of them are already available and could be inhereted from the base class of all objects.  Examples are to_string() or where the reader sees the significance, not the program or eq() for simply objects by just comparing all data.  The ‘beginner trap’ is to be mislead into using a collection in order to enable this use case, when there are better ways to get access to what in the end are generic functions.  Also, the collection view of objects is available anyway. In Python asdict(object) will convert the data of a dataclass to a dictionary for this use, and object.__dict__ will obtain a dictionary of instance data from a standard object.  It is that simple. So again, this is not a valid reason for faux objects.

In Some Languages, the distinction between Objects and Collections is confused.

In dynamically typed languages such as Python,  Javascript and Groovy, the line between maps(called dictionaries in Python) and objects can be blurred.

In Javascript, objects were originally also to store maps, and the notation for accessing data from maps and objects is basically the same javascript.  In Groovy, Maps allow entries to be accessed using dot notation, giving the impression that a program is accessing an very dynamic object, when the data is a map.

And in Python, classes and objects are all built using dictionaries underneath. But in each language it is recognised that maps and objects have discrete purposes, and clean code will use them as intended. Even with Javascript where originally maps and objects were interchangeable, the language was extended to provide purpose specific solutions.

Despite the similarities with dynamic types, there are distinct objects and maps/dictionaries in each language, allowing use of the appropriate map for collections, and classes for objects.  But because of those similarities, when porting code from one language to another, it can be difficult to determine what is an object or collection in the original language, resulting in the wrong selection between object/collection in the converted code.

Dynamically Altering Classes At Run Time.

Dynamically typed languages actually allow classes to be altered at run time! This is not a valid reason to use a map in place of a class unless coding in a statically typed language, and even then, the reason why a classes is to be changed at run time needs a very strong case. Unless really thought out and alternatives eliminated, do not do this anyway.

Data Classes

What is a Data Class?

A data class is It is worth reviewing the talk given by one of the Python team introducing dataclasses to the Python Conference.

The main concept is depite name, dataclasses are for any purpose, not just for holding data.  Dataclasses pre-configured with constructors, string conversion comparison and optionally even more functionality. All the methods to start are provided, so just add data definition, and you have a class.

Data classes are full regular classes, just with some prefab methods already in place, and those methods can be replaced if desired, so generally, any class can be built using a data class.

Kotlin Data Classes

Here is a kotlin data class for the ‘Student’. In fact the data class defintion could finish at the ‘)’ on the first line, but the example adds an alternate constructor to allow instancing a student from a list of strings as would be returned by split(",").


data class Student(val name:String, val age:Int,val city: String){
  constructor(list:List):this(list[0], list[1].toInt(), list[2])
}

The only required overhead compared using a list of strings for storing the data for each Student, is the data class declaration itself.  This is not really overhead,  as the program would instead need a comment explaing how things are stored, and a comment would have the problems of comments and be no shorter. It still has to explain that the data is to consist of these four fields!

The standard constructor is automatic, simply provide four parameters and they will be applied to the declared data.   The optional extra constructor provided here, does make working with data extracted from an input line and then split(), easier to work with, but  is not required to for the class. More details on the Kotlin data class are here, this section is just to get two points across:

  • data classes remove any reason for faux objects
  • data classes can be used for any class, but the larger the class, the smaller the payback over just a standard class

Python Dataclasses

Here is the Python syntax for dataclasses. The ‘from … import’ is only required once per file, is at least ‘shared boilerplate’.  The code declares each ‘variable’ within the class on separate lines, generating more lines, but with simple and concise code.

No ‘post_init‘ method is required for data classes, but it does allow additional code to the automantically generated init method, while still making use of the automatic init.   This ‘post_init’ ensures age actually is an int, making code using the class simpler.

from dataclasses import dataclass

@dataclass 
class Student:
    name: str
    age: int
    city: str

  def __post_init__(self):
      self.age = int(self.age)

Almost all of the code is informative, declaring what the data is to both language and reader alike. Yes, the type information is not processed by the Python language itself, and therefore can be misleading if the code does not follow what is declared, but it does allow a declaration, and there is tooling available to check consistency.  A more complete guide to dataclasses is available here. This page is focused on why to use dataclasses, and official documentation already covers ‘how’.

Conclusion: When to move from faux collections to classes?

In coding solutions to problems, the choice of how to store the data can be a choice between a custom class for that data, or simply storing the data with a ready made collection.

The problem with the ‘faux collection’ is the data declaration basically states all fields are the same, and then comments are required to declare why that is not the case.  As in the example of  with “name” and “age”, which are clearly fields that are not the same.  The result is that real classes mean less code overall, and a clearer indication of intent.

All classes stored as collections should be converted as soon as possible to data classes.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s