AS/A Computer Science: Big data

<< Previous: Databases test Next: Functional programming Home >>

What is big data?

'Big Data' is a catch-all term for data that won't fit the usual containers. Big Data can be described in terms of:

volume - too big to fit into a single server
velocity - streaming data, milliseconds to seconds to respond
variety - data in many forms such as structured, unstructured, text, multimedia

The most difficult aspect of Big Data really involves its lack of structure. This lack of structure poses challenges because analysing the data is made significantly more difficult meaning machine learning techniques are needed to discern patterns in the data and to extract useful information. It is also a problem because relational databases are not appropriate because they require the data to fit into a row-and-column format and relational databases don't scale well when the data won’t fit on one server.

Velocity

Data from networked sensors, smartphones, video surveillance, mouse clicks and more is continuously streamed. This generates a staggering amount of data. Below are some selected statsitics from the internet minute survey in 2022 which calculates the speed at which data is being generated.

Distributed systems

When data sizes are so big as not to fit on to a single server the processing must be distributed across more than one machine. Functional programming is a solution, because it makes it easier to write correct and efficient distributed code that can be distributed to run across more than one server.

The reasons functional languages support this better are:

They use immutable data structures: this means an objects state cant be changed once it is created so functions always return the same result
They are statelessness: Functions have no side effects so it does not matter how often or in what order they are called
Functions are higher-order functions: A higher order function takes one or more functions as a parameter and/or returns a function as a result

Fact based models

Each fact within a fact-based model for representing data captures a single piece of information. Immutable facts are recorded with timestamps. Data is never deleted, the data set continues to grow but the timestamps allow us to discern what is current. Older data and the timestamps themselves are often not represented in a fact based model.

Graph schema are used for capturing the structure of a dataset. A graph schema is made up of nodes and edges each of which may have its own properties. A nodes properties may be shown inside the node as on the left below or attached by dotted lines like on the right below. Relationships between nodes are shown as either directed or undirected arrow.

The benefits of fact based models are that new nodes and relationships can be added without affecting the existing data and where searching a relational database of a huge size takes a very long time with a fact based model we can do this by finding our starting node and following the links.

Knowledge check

Questions:

Correct:

Question text