Big Data and Privacy

Nathan Daniels

By: Nathan Daniels Reading time: 14 minutes Update: 04-28-2022

Over the last few decades, the world has changed tremendously in many regards, especially when it comes to IT. The number of people we are able to communicate with on a daily basis has grown enormously, just like the amount of information we have access to. However, the same is true for the amount of information big companies collect about us. Terms such as big data are used all the more frequently as time goes on. But what does this mean, exactly? What is big data? Is it dangerous? How does it affect our privacy, if at all? Those are some of the questions we’ll cover in this article.

What is Big Data?

List with magnifying glass The term “big data” describes the enormous quantities of (personal) data which are continuously being gathered by different actors. An example would be all of the information Google gathers about its users’ search queries. The phenomenon of big data is a relatively recent development that started because (large) companies and organisations, such as Facebook, Google and most governments, started to gather ever more data about its users, customers and citizens than before. New technologies, a digitized world and the internet have aided this development immensely.

Collections of big data are often so vast that it’s impossible to analyze them using traditional data analysis. However, if one analyses big data the right way, interesting patterns and conclusions can be induced. For instance, big data is often used for large scale market research: which products are most likely to be purchased? What kind of advertising is most effective when you want to reach and persuade customers?

In order for a dataset to be considered big data, it should usually meet the following three criteria, also known as the 3 v’s:

Volume: Big data is anything but a small sample. It involves vast collections of data, resulting from long, continuous observation.
Velocity: This has to do with the impressive speeds at which big data is collected. Moreover, big data is often accessible in real time (as it is being gathered).
Variety: Big datasets often contain many different types of information. Data within big data-sets could even be combined to fill in any gaps and make the dataset even more complete.

Aside from these 3 v’s, big data has some other characteristics. For example, big data is great for machine learning. This means it can be effectively used to teach computers and machines certain tasks. Moreover, as we’ve already briefly touched upon, big data can be used to detect patterns. This mostly happens in a very effective way, by means of computers working on the data. Finally, big data is the reflection of users’ digital fingerprints. This means it’s a by-product of people’s digital and online activities and can be used to build individual personal profiles.

Different Kinds of Big Data

There are different ways to classify big data. The first way, which is used most frequently, differentiates big data based on the kind of data which is being collected. The three possible categories used for this type of classification are: structured big data, unstructured big data and semi structured big data.

Structured: When big data is structured, it can be saved and presented in an organised and logical way, making the data more accessible and easier to comprehend. An example would be a list of customer addresses created by a company. In this list, one would likely find customers’ names, addresses and maybe other details such as phone numbers, all structured clearly in, for example, a chart or table.
Unstructured: Unstructured big data is not organised at all. It lacks a logical presentation which would make sense to the average human being. Unstructured big data doesn’t have the structure of, for instance, a table that denotes a certain coherence between the different elements of the data set. Hence this type of data is quite difficult to navigate and comprehend. Many datasets initially start out as unstructured big data.
Semi-structured: Semi-structured big data, as you might have guessed, has characteristics of both structured and unstructured big data. The nature and representation of this type of data aren’t completely arbitrary. Yet it isn’t structured and organised enough to be used for a meaningful analysis, either. An example would be a web page which contains specific meta data tags (extra information which isn’t directly visible in the text), for instance because it contains certain keywords. These tags effectively show specific bits of information, such as the author of a page or the moment it was placed online. The text itself is essentially unstructured, yet the keywords and other meta data it contains help to make it a somewhat suitable basis for analysis.

Classification based on the source of big data

Another common way to distinguish between different kinds of big data is by looking at the source of the data. Who or what has generated the information? Like the previous division, this classification method also consists of 3 different categories.

People: This category concerns big data generated by people. Examples would be books, pictures, videos as well as information and (personal) data on websites and social media, such as Facebook, Twitter, Instagram, and so on.
Process registration: This category includes the more traditional kind of big data, which is gathered and analysed by (big) companies to improve certain processes in a business.
Machines: This type of big data results from the ever growing number of sensors that are placed in machines. An example would be the heat sensor that is often built into computer processors. The data generated by machines can often be very complex, but at least this type of big data is generally well-structured and complete.

What Can Big Data Be Used For?

Facebook logo Everything discussed so far might still sound somewhat abstract. Let’s make things a little bit more concrete and discuss some real-life applications of big data. After all, there are many, many ways in which companies and organisations use big data. One of the first things that comes to mind is the enormous amounts of data companies gather about us. Facebook collects data on all of its users and analyses this to decide what to show you on your timeline. Of course, this is done to cater to your personal wishes and interests. Facebook hopes this will get you to stay on their website for longer periods of time. In turn, Amazon gathers information about its clients and the products they buy. That way, Amazon can recommend products they think you’ll be interested in and increase their earnings this way.

However, big data is also used in ways completely different from the commercial strategies described above. For instance, public transport companies can gather data about how busy certain routes are. Afterwards, they could analyse this data to decide, for example, which routes require additional buses or trains. Another well-known case of effective use of big data concerns international delivery giant UPS. UPS uses special software which was developed after big data analysis. This software helps UPS drivers avoid left-hand turns, which are costlier, more wasteful and more dangerous than right turns. Supposedly, this system has already saved UPS millions of gallons in fuel, all thanks to big data.

Another interesting example of big data gathering are DNA tests and websites such as MyHeritage DNA. This website claims it can help you “uncover your ethnic origins and find new relatives” with a simple DNA test. Needless to say, this process involves a lot of data gathering and cross-referencing, making it another major player in big data gathering and usage. “Traditional”, physical DNA tests also involve a huge amount of big data, since companies who conduct these tests will gain extremely large data sets about many, many people. Of course, it’s important to be aware of the possible risks that come with these big data gathering processes. These risks will be highlighted in the next part of this article.

Is Big Data Dangerous?

As shown above, big data can be incredibly useful in many cases. It provides us with tons of information we can use to streamline processes and make companies more efficient and profitable. However, this doesn’t mean gathering and using big data is completely risk-free. There are five important risks that come with big data. We’ll be discussing all five here.

Hackers and thieves

With everything we do online, there’s an inherent risk that our personal data and information on our internet activities could be stolen. Every internet user has to be aware of this. The number of data leaks and thefts has increased drastically over the past few years. There are often stories in the news about criminals selling data sets containing passwords and other information on places such as the dark web. Often, these data sets are stolen from official websites, companies and organisations. The bigger these data sets are, the more interesting it becomes for thieves to try to obtain them. If they get their hands on these data sets, they could cause a lot of problems. Needless to say, this could also greatly compromise your privacy.

Privacy

The practice of gathering personal data is becoming more and more widespread. However, the current privacy regulations can’t keep up with the rapid developments in technology that makes this practice possible. This leaves space for grey areas and uncertainties that can’t be solved by looking at the law. Important privacy concerns that arise include: What kind of data is allowed to be collected? About whom? Who should have access to this data?

When collecting large amounts of data, chances that sensitive personal information is included in those datasets are high. This is problematic, even when hackers and thieves aren’t at play. After all, privacy-sensitive data could be abused by anyone with ill intentions. This includes (malicious) companies and organisations.

You should know that even ISPs collect a lot of information about their users, which they sometimes sell or pass on.

Poor data-analysis

Many companies and organisations collect big data, because they can use it for interesting analyses. This might give them important new insights into whatever they’re researching (like, for example, consumer habits). In turn, these insights and conclusions could translate to changes within the company that result in higher margins and more profit. However, just like with any other normal dataset, an incorrect analysis of big data can have serious consequences. After all, an improper analysis can easily lead to wrong conclusions. These can in turn translate to ineffective or even counterproductive measures being taken.

Gathering the “wrong” data

Big data is becoming increasingly popular and organisations are more and more willing to collect all sorts of data. This means gigantic amounts of data are being collected without there being a clear reason for analysing them. In other words, it creates a huge database of raw information that has been gathered just in case. Companies are likely thinking it’s easy enough to gather all that data, so they might as well do it. Needless to say, this isn’t good for anyone’s privacy. It could even lead to irrelevant or “wrong” data being gathered and analysed. If the conclusions drawn from this analysis are used in management, it could lead to the same ineffective measures mentioned in the previous paragraph.

Collecting and saving big data with ill intentions

The collection of big data is used more and more often by companies, organisations and governments so they can make accurate individual profiles on people. Users or citizens are hardly ever notified about which of their personal data is being registered, let alone why and how. Needless to say, this has serious implications for their online privacy. Everything they do online, can be saved and viewed later. Moreover, big data collectors could easily influence and manipulate people’s decision making by analysing and using the collected data.

Big Data and Privacy

Smartphone with picture of ear As you’ll probably understand by now, big data comes with a lot of disadvantages and risks. Nevertheless, many companies and organisations still collect data on a huge scale, mostly because of how it can help them grow and advance. Collecting big data is easier than ever before. This has huge consequences for our privacy. We’ve already briefly discussed the possible privacy dangers of malicious parties collecting bad data. Since our privacy is so closely tied to the mass collection of personal data, we want to use this section to discuss the different privacy concerns that come with big data.

Large scale data collection

Lots of companies, including Google, Facebook and Twitter, are heavily dependent on advertisements to sustain themselves and make a profit. To make these ads as effective as possible, these companies make detailed profiles on their users, especially taking their likes and interests into account. This is a form of big data. Likewise, governments and secret services are dependent on big data as well. They use this vast amount of information to track and investigate people they deem suspicious. Of course, this also means there’s a lot of big data for cyber criminals to get their hands on and maybe even manipulate and abuse. This can create all sorts of privacy and identity-related problems. One that comes to mind, is identity theft.

Still, the possibilities that come with the collection in databases are much broader than this. These days, technology has become so advanced and “smart” that it can combine data sets. This can be done in such a clever and crafty way, that large corporations and organizations likely know more about you than you do! Who you are, where you live, what your hobbies are, who your friends are: none of this information will be private any longer. Not a very comforting thought, you might think. Fortunately, there are some ways to protect yourself from the large scale privacy infringement big data can cause.

Laws on privacy

Cookies on screen Privacy laws and regulations can protect us against privacy infringement, but only up to a certain extent. To make matters more complicated, privacy laws often differ greatly between different countries and regions. For instance, in Europe a relatively strict consumer privacy law called the General Data Protection Regulation (GDPR) is in force. This law applies to all EU member states, although the details might differ per country. Many international companies have decided to abide all of their business to the GDPR. This is why Google, for example, now allows users to request a deletion of personal information. However, privacy laws in the United States differ from state to state and don’t protect consumers as well as the EU. Unfortunately, this is even true for the toughest privacy law in the US, the California Consumer Privacy Act.

In short, there’s no such thing as a strong “global” privacy law that applies to all big data collectors and protects all users. This means our privacy is not just harmed by big data-collectors in illegal, but even in perfectly legal ways, as paradoxical as this may sound. Fortunately, large-scale privacy infringements exposed by whistle blowers like Edward Snowden and Chelsea Manning have greatly increased awareness for the risks of big data. Of course, this is only a first step in improving current privacy laws.

Many internet users aren’t willing to await an improvement in privacy laws – and rightfully so. Rather, they want to take action themselves by doing whatever they can to protect their privacy. Do you want to avoid becoming part of countless big data-sets as well? There are several tips and tricks to help you on your way.

How to Keep Your Data From Being Saved in Big Datasets

Big datasets seriously affect your privacy and security. These datasets might contain all sorts of (personal) information, which could be abused by big companies or even cyber criminals. That’s why you should always make sure to leave as little of an online trace as possible. The following tips can help you accomplish this:

Try to minimize the use of your personal information when creating passwords or in general on the web. For instance: avoid using your name, address, phone number, date of birth, and so on.
Always remember the following: everything you publish on the internet, will be on there forever. This might not always be completely true, but this level of caution does help safeguard your privacy. You’ll automatically handle your private data with more care once you’re aware of this fact.
Make sure your internet connection is secure and anonymized, for example by using the Tor-browser or a VPN for example.
Use one or several ad-blockers in your browser.
Use one or more browser plug-ins which block trackers and cookies.
Regularly clear your cache and delete your browsing history and cookies.
Log out of websites when you’re not actively using them.

Taking these steps is a good start when it comes to safeguarding your online privacy and security. Keep in mind, however, that big data is collected in many different ways – not just online. In short, wherever you are and whatever you’re doing, you should always be vigilant and try to protect your (personal) data from big data-collectors.

Nathan Daniels Author

Tech journalist

Nathan is an internationally trained journalist and has a special interest in the prevention of cybercrime, especially where vulnerable groups are concerned. For VPNoverview.com he conducts research in the field of cybersecurity, internet censorship, and online privacy. He also contributed to developing our rigorous VPN testing and reviewing procedures using evidence-based best practices.