Functional Data Science for Titanic Dataset #1: scala, monads, and functors in functional programming for data science

Angelica Tiara
9 min readJun 29, 2023

--

This piece is part 1 of a 2-part series of articles.

Looking at the title, you might be thinking: ‘there are many easier ways to do ETL in data science.’ — frankly speaking, I understand that completely as someone is in the habit of using object-oriented paradigm (a.k.a. the bread and butter of every aspiring data scientist: python). However, I also believe that the key to overcome any difficulty lies in a word from the previous sentence: habit. Once we get into the habit of using something new, then it should be easier with time.

Hence, if you are looking to grasp the concepts of functional programming in data science and possible no-nonsense, very practical application of it, you come to the right place — because I have the same mindset as well.

Now why do we apply functional programming for a data science job in the first place, especially when ‘easier’ (relatively speaking) and more popular options are available?

Functional programming vs Object-oriented

It’s me, crying while pouring over the Scala and Haskell textbooks.

Object-oriented programming (OOP) is currently the most widely used paradigm — and arguably, the most popular as well. In 1981, when OOP was in its initial stages, David Robson wrote, “Many people who have no idea how a computer works find the idea of object-oriented programming quite natural. In contrast, many people who have experience with computers initially think there is something strange about object oriented systems.” It was published on Byte magazine that year and for many, it had become the unofficial introduction of Object-Oriented Software Systems in the tech world.

And his words are quite prophetic today. The idea of organizing codes into objects (as instances of classes) with their own attributes (data) and behaviors (methods/functions) is very popular for faster development cycles and maintainability of codebases. Hence, many developers around the world are introduced to OOP first in their education (me included). It doesn’t help that OOP is also promoted by Java, C++, and Python — who are both very popular programming languages used in many systems.

In the context of data science, Python is usually the first programming language used to train early data scientists. It’s simple to read, has a relatively low learning curve — at the early stage of learning data science, it’s better to try to grasp the mathematical logics behind each algorithm rather than being concerned about syntax errors — and of course, Python offers a very huge number of libraries for Big Data. You only need to call on a function, or even on a whole algorithm, with a simple line — and voila! You have applied logistic regression to your dataset — see below ^.^

X = [[1], [2], [3], [4], [5]]
y = [2, 4, 6, 8, 10]

model = LinearRegression()

model.fit(X, y)

new_data = [[6]]
predicted_output = mcodel.predict(new_data)

print("Predicted output:", predicted_output)

I started off from an engineering education and background before specializing in data science — so I completely relate to this point of view and can probably write a thesis defense about the benefits of OOP. But after about a year of working professionally in the data science sphere, I’m starting to gain more and more appreciation for the other side of the coin: functional programming.

It has started off when I started researching into Scala and tried to apply what I learned into use — as what’s the use of knowledge without application and action, right?

Functional programming (FP), or otherwise also known as functional data science when applied to DS methodologies, treats computation as the evaluation of mathematical functions and avoids changing state and mutable data. As opposed to OOP, FP treats mathematical and algorithmic functions as the main actors — they can be assigned to variables, passed as arguments, and returned as results. Immutability is the main principle of FP, which means that data doesn’t change after creation, and pure functions produce the same output for the same input without any side effect.

The choice between FP vs OOP has been a well-known point of contention between theoretical mathematicians and practical engineers. There is no wrong or right in this debate, only a matter of perspective and objective. What’s the goal? If it’s to build a codebase that is easier to organize, understand, maintain, and reuse, as well as to find support within a rich ecosystem of libraries and frameworks, then OOP is arguably more practical to use. On the other hand, if the goal is to attain more control over data integrity, immutability — data is not modified after creation, and hence, the risk of unintended side effects or data corruption is greatly reduced — and powerful data transformations through function chaining and higher-order functions, then FP and its rigorously mathematical logic implementation is a relatively better choice.

To illustrate the application of FP in data science methodologies, let’s do a classical ETL (read: extract, transform, and load) data processing using Scala on a very popular dataset for data scientists out there: the Titanic survival data. (Note: at the time of writing, the news is buzzing with the event of the Titan/Titanic submersible’s implosion which tragically took the lives of 5 people. I think it’s quite poetic to do yet another Titanic dataset analysis.)

Scala, monads, and functors

For this analysis, we’re going to use Apache Spark, an open-source processing engine for big data processing, with Scala as our programming language, and Jupyter Notebook as our medium. As for the algorithm, we’re going to try the Random Forest Classifier for simplicity’s sake.

So best be prepared to install Apache Spark on your PC/laptop/device before starting this quest.

In the meanwhile, as we’re delving into functional data science — let’s call it FDS; a related concept to FP, which is essentially leveraging the principles of FP (immutability, purity, and composability) to data science jobs — it’s worth noting some of the more generic concepts of this area.

Monads and functors

Monad is essentially a design pattern, like a magical box that lets you perform operations on your value without directly interacting with it.

Monads came from a theoretical mathematical concept of category and group theory in the realms of abstraction. Think of categories as big worlds where objects live — and these objects can be anything like numbers, shapes, to more complex things like groups or functions. These categories have special arrows (which are called morphisms) that define how these objects are related or connected or transformed. Two arrows together can create a new arrow which is called composition — it’s a way to chain and combine different relationships or transformations.

If you come from OOP, hearing ‘objects’ might be somewhat of a deja vu, but there are significant differences. Objects in category theory are way more abstract and they represent many mathematical and conceptual elements. On the other hand, objects in OOP are pretty much concrete instances of classes representing real-world or software entities — and encapsulate both data (state) and behavior (methods). Objects in category theory are abstract instances that serve as the building blocks of categories.

Now, the mapping between the categories — which includes 1) the structure of the objects and 2) the relationships between objects and arrows — is called a functor. Monads are special types of functors: it is a functor that is equipped with additional structure: 2 natural transformations called unit and bind (or return and join, also known as “>>=” or “flatMap”). These transformations define the behavior of the monad and allow for sequencing and composition of computations within the category.

Essentially, monads serve as objects in OOP only in terms of encapsulation. While objects in OOP encapsulate state and behavior as a structured way to manipulate and interact with data, monads encapsulate computational or transformational logic within the category.

Note that within the category is a key definition here.

Since monads are essentially a special kind of functor which operates within the category, then we call call monads as a kind of endofunctors. There is a famous saying, “a monad is just a monoid in the category of endofunctors.” (by Sounders Mac Lane in his book “Categories for the Working Mathematician”) — of which there are many complex explanations that can be made, but essentially, in the context of monads as a kind of endofunctors, they operate similarly like that of a monoid — which is a general algebraic concept of combining and identity.

Monoids, in practical use:

(set, binary-associative-operation, zero)

(Int, +, 0)
(Int, *, 1)
(String, +, "")
(List, ++, Nil)

Functors: (this concept is also present as the map() function in many OOP languages)

Let "fa" be a Functor of type F[A] and "f" and "g" be functions of type "f: A => B" and "g: B => C"
1. Identity: fa.map(a => a) == fa
2. Composition: fa.map(f).map(g) == fa.map(g compose f)

So, a category of endofunctors is a category where the objects are endofunctors and the arrows are natural transformations. When the objects are types, arrows are functions. When the objects are functors, arrows are natural transformations.

Let’s see this in a practical sense:

For types A and B, the function is:
f: A => B

For functors A[T] and B[T], the natural transformation is
F[T]: A[T] ~> B[T]

So ‘monads is a monoid in the category of endofunctors’ can be illustrated in the most practical way as:

Monoids == (set, binary-associative-operation, zero)

Thus, (endofunctors, F[F[A]] ~> F[A], Id[A] ~> F[A])
Elements: endofunctors
Binary-asssociative-operation: F[F[A]] ~> F[A]
Zero (element): Id[A] ~> F[A] (as the identity functor - it does nothing :D)
--- for example: ida: Id[A] and a function "f: A => B"
--- is similar to ida.map(f) == f(a) which is similar to just applying the function directly.

Hence,
>>> Id[F[A]] == F[A]; F[Id[A]] == F[A]

so "Monoid in the category of endofunctors" == "Monads"

These are generally the most important concepts before we jump into functional programming and data science.

You might be asking now, why do we need to understand these abstract concepts?

One simple reason: as a de facto, functional programming is programming combining functions.

We bundle a bunch of simple mathematical functions, then combine them into more and more complex functions until we have a full program.

Let’s do a simple practice of combining the simplest of functions:

Given the functions:
f: A => B
g: B => C
h: C => D

One's output is other's input, because the types align with each other.
Combination:
h(g(f(a))) //this will produce a value of type D, as a |> f |> g |> h

As long as the types align and fit with each other in a sequential way, then it’s easy to directly combine the functions.

But what if they don’t align or don’t have the same types / contexts / effects?

Practice example:

Given the functions:
fl: A => List[B] //List[B]: the context where we can have 0, 1 or many Bs
fo: A => Option[B] //Option[B]: the context where we can have 1 B or none
ff: A => Future[B] //Future[B]: the context where we will possibly have a B in the future (asynchronous operation)
fe: A => Either[X,B] //Either[X,B]: the context where we can have a B or an X, but never both at the same time
ft: A => Try[B] //Try[B]: the context where we can have a B or an Exception

Let's choose the 'LIST[]' context and re-arrange the functions accordingly:
f1: A => List[B]
g1: B => List[C]
h1: C => List[D]

We cannot do
h1(g1(f1(a)))
a |> f1 |> g1 |> h1
on these functions. The types and contexts don't fit.

This is where Monads come in.

For example, a monad - represented by M - 
can be defined as a type constructor / type parameter by:

M[_]

Then we assign a normal value of A by A => M[A]
M(a) or M.unit(a)

Now we want to do a combination of function types: "M[A] => (A => M[B]) => M[B]"
This might be commonly called bind, flatMap, >>=, etc.
def flatMap[B](f: A => M[B]): M[B]

Let "ma" be a Monad of type M[A] and "f" and "g" be functions of type "f: A => M[B]" and "g: B => M[C]"
//1. Left Identity: M.unit(a).flatMap(f) === f(a)
//2. Right Identity: ma.flatMap(M.unit) === ma
//3. Associativity: ma.flatMap(f).flatMap(g) === ma.flatMap(a => f(a).flatMap(g))

Given the functions:
f2: A => M[B]
g2: B => M[C]
h2: C => M[D]

Thus:
M(a).flatMap(f2).flatMap(g2).flatMap(h2) //this will produce a value of type M[D]
satisfying:
M(a) >>= f2 >>= g2 >>= h2
a |> f |> g |> h
(for basic function composition)

This is why Monads are the essential building blocks of FP and FDS. In a programming methodology where we combine mathematical functions in a set rules given by the category theory, we need Monads to preserve the structure, the context/effect, and the relationship/transformation between functors.

We now have the basic — and most important — idea about the concepts behind functional methodology.

Now it’s time for the more practical application in data science.

Next:

Functional Data Science for Titanic Dataset #2: using Spark with Scala for ETL, data imputation, and feature engineering

--

--

Angelica Tiara

I have two personas . The scientist and the ballerina. Coding data algorithms by day, spinning pirouettes by night.