by Lili Simon
Table of Contents
Oxford Dictionaries defines big data as “extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions”. People frequently throw around the term in the context of evolving data management systems and how we think about and use information available about the world. By its nature, big data as a concept is quite ambiguous and can take on a variety of different meanings depending on the context. In particular, businesses seek to collect information about people’s habits in order to better inform their decisions to optimize performance. Transactions, search histories, social media activity, videos, images, and countless other sources of information are all recorded, tracked, and aggregated in huge databases that grow and evolve much too rapidly to be analyzed using conventional statistical methods. As such, big data analysis methods have been created to solve these new issues.
To businesses, access to all of the information they could ever dream of (and more) about their customers’ or clients’ needs and wants is extremely valuable. They have the potential to understand exactly what, when, and where customers are consuming and use that to target promotional material to its intended audience, to optimize their supply chains, to predict trends in the market, and to remain competitive by keeping close tabs on the operations of similar companies. Of course, this is just a small sampling of what data could do for businesses, but it does not account for the difficulty in actually obtaining data and extracting valuable information from it, which in nearly all cases poses a much larger challenge than anticipated.
Perhaps unsurprisingly, the volume and velocity with which data is accumulating has increased by many orders of magnitude in a short timespan – this is the explanation for why big data has become a topic of such focus in the private sector, the government, and academia recently – it simply has not been around for very long. To understand how the world has arrived to a stage where data is quite literally central to every aspect of our lives and where data has become the currency driving artificial intelligence, blockchain, and internet technologies, it is worth taking a retrospective look at the evolution of data throughout history. It is also through this lens that we can take a closer look at the opportunities hidden in big data currently and where it might be heading in the future.
The beginnings of data collection
In prehistoric times, humans had limited tools at their disposal to record aspects of their daily lives. The Ishango Bone, discovered in what is today the Democratic Republic of the Congo, is the oldest known mathematical artefact. It is approximately 20000 years old and features a series of notches that archeologists believe were used to track trading activity or inventory of supplies. Such instruments would have also enabled people at that time to perform elementary calculations to predict how long those supplies would last them. Having the ability to record data removed a significant mental burden from people who otherwise would have had to store that information mentally, and as a result enabled people to record and retain more information more accurately over time. For this reason, the innovative method of keeping tally marks was used to count and record data for many years. Eventually, this would evolve into more sophisticated methods of recordkeeping, evidenced by clay tokens found in Sumeria (dated to 5200 years ago) which were sealed in clay balls with corresponding impressions on the surface. Later, clay tablets featured cuneiform signs for numbers and commodities to track the stock of goods. It was also around this time that the first libraries appeared, signifying humanity’s initial attempts at organizing and storing mass data. The Library of Alexandria was the most ambitious of these endeavours: it was the largest collection of data in the ancient world and contained up to half a million scrolls with all of the knowledge about everything that people in the region had acquired.
As mathematical theory grew more advanced, so did the methods of recordkeeping employed. Double entry bookkeeping was first described by Luca Pacioli in 1494 in his paper Details of Calculation and Recording. For the next few hundred years, this text would serve as the primary reference and learning material in the areas of bookkeeping and accounting. Having constant access to information about debits and credits helped merchants make decisions about their businesses that helped them to grow and remain profitable. Soon enough, mathematics and data collection had developed to the point where statistics could begin to emerge. In 1662, John Graunt published his book Natural and Political Observations Made upon the Bills of Mortality, which was an instant bestseller, and for good reason. Graunt used the annual number of christenings and burials recorded in London to calculate the population of the city. He also theorized that by recording information about mortality, he could be able to predict the spread of the bubonic plague through Europe, something that would have been well beyond the scope of data analysis previously. By 1700, many principles of probability were well understood, but the question of inference had not yet been considered. Jacob Bernoulli’s weak law of large numbers began progress on this path and was later refined by other mathematicians. The method of least squares was developed by Gauss though his work on the normal distribution in 1809 and the Central Limit Theorem was published by Laplace in 1810. These two ideas were combined by Laplace in the Gauss-Laplace synthesis, which essentially improved effective ways to combine data with the ability to accurately quantify error.
These and other theoretical developments in the areas of probability and statistics laid the groundwork for more efficient real life statistical projects. The biggest and most ambitious of these projects were government censuses, which became more standardized and commonplace around this time. The US Census Bureau ran into a problem following their 1880 census: they estimated that it would take 8 years to process all of the data collected, and worse, that the 1890 census data would take over 10 years to process, making it outdated by the 1900 census by the time it would be ready. This issue was eventually rectified by Herman Hollerith, an engineer who designed a tabulator and sorter using punched cards which reduced the amount of time needed to process the census results to three months. These machines were quickly adopted for use in censuses across the Americas and Europe.
Computers: processing and storing data
Early calculators and data processing machines like the one introduced by Hollerith were harbingers of what would soon become a new era characterized by computers. In fact, the first known mechanical computer dates back to circa 100 BCE. The Antikythera Mechanism was produced by Greek scientists and used 30 gear wheels to produce results related to astronomical phenomena. Other significant early analog computers included Babbage’s Difference Engine (used to automatically produce mathematical tables), Thomson’s wheel-and-disc integrator (which could compute the integral of the product of two functions), and Bush’s differential analyzer (a general purpose mechanical computer). These machines were all extremely useful in automating calculations that would have been tedious and time consuming to perform by hand, allowing their users to increase their efficiency many times. They also held the promise of being developed into multi-purpose machines that could perform any operation given appropriate instructions for it.
The principle of the modern digital computer was developed by Alan Turing in 1936. This computing machine worked (theoretically) with a program of instructions stored as symbols in the memory of the computer. The program would dictate the actions of the scanner which could move through the memory and read and write symbols as necessary. The first electronic digital computer to be built was named Colossus and was used by British cryptanalysts in 1944 to decipher encoded German radio communications during World War II. After the war, successes in breaking encryption necessitated research efforts into improving cipher security. Advances in computing technology made breaking encryption exponentially more difficult, so that data could be stored and transmitted without the fear of it being stolen and deciphered. In particular, public-key cryptography relies on a publicly shared public key that is used to encrypt messages and individual private keys which are kept secret and are used to decrypt messages encrypted with the public key. It is difficult to conceptualize the sheer significance of these breakthroughs, but to demonstrate with a few examples, technologies such as PDF signing, email, and blockchain were developed as a result.
Earlier in the 20th century, Austrian engineer Fritz Pfleumer had invented a method for storing sound magnetically on tape, which revolutionized the possibilities for storing data digitally. In fact, the principle of storing data magnetically is used to this day – the overwhelming majority of digital data in existence is stored on magnetically on hard disks. The development in computing technology and digital data storage capacity and their application to cryptography as a way to ensure data security was a recipe for a boom in the amount of data collected and stored worldwide. By 1965, the US Government had made plans for the world’s first ever data centre, which was designed to store 742 million tax returns and 175 million sets of fingerprints on magnetic tape. Since then, the development of optical discs, flash drives, solid state drives, and other data storage technologies have further reduced the cost of storing data, making it worthwhile to collect and retain the massive amounts of data that we are accumulating.
The internet age
The internet began in the 1960s as a way for US government researchers to transmit information between computers (that were large and difficult to transport) in a non-centralized network. This project, called the Advanced Research Projects Agency Network, was successful but was limited in scope because only certain organizations which were in direct contact with the Defense Department could participate in it. In order to standardize the communication between computer networks, a new communications protocol, called the Transfer Control Protocol/Internetwork Protocol (TCP/IP), was created. In essence, this new system allowed all computers connected to the ‘internet’ to communicate with one another. The TCP/IP technology developed by ARPANET was eventually applied to a distributed network that could handle greater traffic and therefore allowed for a greater number of organizations to connect their computers to the internet. By the 1990s, estimates suggest that the number of computers connected to the internet was doubling each year. It was also at this time that Tim-Berners Lee invented the World Wide Web, an information system where resources are identified using uniform resource locators (URLs) and can be accessed over the internet.
Having computers connected to one another via the internet opened up a whole horizon of new possibilities for what computers could be used for, and it also increased the number of computers in household and business use. Google Search was launched in 1997, making it easier than ever for users to browse the internet for information. The impacts of the internet and the world wide web on the world are probably impossible to overstate. Since its introduction, virtually all of the information that exists in the world has been uploaded to the internet. Any information that anyone could reasonably want to know is available on the internet, official communications take place through the internet, financial transactions happen through the internet, social interactions happen through the internet, photos and videos get synchronized to the internet, and many people spend the overwhelming majority of their time studying or working on the internet. The amount of data generated from this activity alone is impossible to comprehend in any meaningful way, but data is also being generated at increasing rates from other sources. To name a few examples, CCTV cameras record everyone that goes through certain public spaces, fingerprints and other biometric information is collected and stored by governments, and GPS devices can be used to track one’s location at all times.
Even thinking about all of the data that exists is overwhelming, let alone trying to extract valuable information from it. Big Data is the cumulation of this evolution of data collection and storage, and like many times before, we are currently faced with the challenge that our tried and true data processing methods simply cannot handle data on such a large scale. That being said, there have been fantastic improvements in our ability to make sense of big data. Firstly, there are some simple statistical descriptions such as averages, variances, and graphics that can be found for big data and used to interpret it. Outside of that, one commonly used approach is classification or clustering of data. This involves dividing the data into sections based on the inherent properties of the data itself. The idea is to have the elements in the same category share as many characteristics as possible and to have elements in different categories share as few characteristics as possible. Another approach is to find frequent item sets. There, the objective is to identify elements which appear disproportionately more than others in the data. Using machine learning or artificial intelligence technologies is also a popular approach to handling big data. Data compression is frequently pursued when dealing with big data because if it is possible to reduce the amount of data without losing much information, then the reduced storage space required and increased processing efficiency makes it much easier to analyze in greater detail.
The future of Big Data
Given all of the potential big data has to unlock coveted information for businesses and governments that could enable them to make better-informed decisions and operate much more efficiently, it seems almost inevitable that big data will continue to remain the centre of attention. Following the recent trajectory of societal trends, the amount of data generated on a daily basis will continue to increase and there will be continued demand for techniques to process and understand that data in order to extract information from it. It also seems likely that research into big data analysis and efforts to make it more effective and efficient will eventually be more and more successful, meaning that the potential currently hidden in big data will slowly be unlocked. This will benefit governments and large corporations which have access to that data, and will also likely benefit the general public to some extent. Much like how navigation systems help us avoid traffic by using data from other people’s GPS locations, there are a myriad of ways in which having massive amounts of data aggregated can make individuals’ lives more convenient.
It is tempting to fantasize about the potential benefits big data can offer us without taking a critical look at its costs. There are many concerns about privacy and the ethics surrounding data collection. It is enough to consider the fact that most people walk around with smartphones equipped with GPSs, microphones, and cameras to begin to grasp where privacy issues stem from. The capacity we have developed to collect data at a large scale comes at the expense of privacy for each individual person whose location is being tracked, search history is being recorded, and camera and microphone is being tapped into. Even if those who collect this data do so with good intentions, which is a strong assumption to make, there is still potential for that data to be stolen and exploited by parties that are self-interested and not accountable to governments or the general public. Additionally, data is not always as objective of an imprint of the world as it may seem. The way in which data is collected or even the way in which society operates can encode systemic biases in data. If data is not evaluated critically in the appropriate context, it can cause the biases present in the data to be reinforced by decisions that are made based off of it, and can cause real harms to people. For example, there have been cases where the data about crime rates in given neighborhoods of a city was used to decide where to send more and where to send less police officers. Sending additional police officers to patrol high-crime areas increased the recorded crime rates there in comparison to areas with less police officers, even though the difference in the actual number of crimes may not have been as large as the data suggested. High-crime neighborhoods are typically neighbourhoods with high poverty rates and high proportions of minorities, so taking the data at face value without looking into its context is likely to harm some of the most vulnerable members of society. We cannot and should not continue using big data without understanding where it comes from, and we have to address the privacy and ethical concerns surrounding big data collection to avoid unsavoury outcomes.
Those looking to take advantage of the opportunities that big data has to offer should take into account a few considerations. The 5 Vs of data provide a good overview of the challenges associated with working with big data: volume, velocity, variety, veracity, and value. There is a lot of data, it is accumulating rapidly, it is frequently unstructured, can be inconsistent, and in and of itself it is not useful. In order to achieve success when working with big data, these challenges need to be overcome. Firstly and perhaps most importantly, you cannot let the volume of data and endless arsenal of data analytics tools paralyze you. You have to choose the best tools available to you and make the most of the data rather than perpetually stress over other avenues you could have taken. Secondly, the new data infrastructures that are being developed for big data need to be clean and simple. The value in big data is that it has the potential to provide general insights about large systems, so achieving that should be at the forefront of any new infrastructures. Although it is tempting to develop overly complicated methods which yield complex and difficult to understand insights, this is ultimately not useful. If there is a lesson to be taken away from the history of big data and how data has developed, it is that the simplest and cleanest solutions to problems always tend to be the most valuable.
- What is Big Data?, SAS Institute