Data Broker Practices and Privacy Ethics: How to Take Back Control of Personally-Identifiable Information

Data mining can be dangerous for users who trust organizations to maintain records or sensitive information.

Abstract

Data mining can be dangerous for users who trust organizations to maintain records or sensitive information. There is no federal law within the United States (U.S.) that prohibits data mining operations that target personally-identifiable information (PII) and place it for sale. Data mining is increasing, and more U.S. companies are engaging in the practice of using PII to target advertisements to possible users of interest. The purpose of this research was to discuss what data mining is, define what types of organizations use PII and data mining for positive or negative purposes, suggest recommended publications for organizations to follow for enhancing security and privacy practices, and provide methods to remove PII data from search engines and public directory platforms. This research examined key elements of data mining and how organizations can profit by selling it. This research found that there is no U.S. federal law that prohibits a company from using PII for gain but there may be some regulations that specific industries, such as healthcare, must abide by, such as the Healthcare Insurance Portability and Accountability Act (HIPAA) or Federal Financial Institutions Examination Council (FFIEC). This research recommended that organizations that handle sensitive PII on a large-scale use recommended publications, such as the National Institute of Standard and Technology (NIST) SP 800-122 and or ISO 27001 control series. This research recommended numerous ways for a user to remove PII from multiple public directories and search engines.

Statement of the Problem

Mining, purchasing, and selling personal or personally-identifiable information (PII) for profit is a complex task in which many companies in the United States (U.S.) engage (Kumar & Reinartz, 2018). United States companies mine or buy data that can include PII to advertise and market products to consumers that the company believes the individual will likely purchase based on their behavior on the internet (Zhang, 2018). Data mining of PII has become increasingly targeted to extract as much information about a user as possible and has increasing, if not alarming, security and privacy implications of which every user should be aware. To understand data mining, users must first understand how PII is targeted, collected, and sold.

The Internet

In an always-on machine powered world, the internet represents one of the most complex creations that mankind has innovated to date and lends itself to many privacy risks for individuals who use it (Chen, Beaudoin & Hong, 2016). Using the internet, however, is not always an option, and has become a way for most users to sustain a comfortable way of life. On August 6th, 1991, the first-ever website was published by Berners-Lee, marking the beginning of public availability of the internet (Nix, 2018). Public availability of the internet changed the world, permitting those who could afford it as an option to begin messaging people across the U.S., purchase items, and enjoy the convenience of online bill pay just a few months later. As more users began to slowly adopt the internet into everyday life, it became a reality to many that privacy was never a focus of the internet, and the internet became a breeding ground for criminal activity known as cybercrime (Chen, Beaudoin & Hong, 2016). Cyber-criminals look for easy and exploitable information and often begin with PII (Chen, Beaudoin & Hong, 2016).

In 2019, 4iQ, an identity intelligence and attribution firm, surveyed 2,800 people living in the U.S. and concluded that 44 percent of Americans believed their PII had been stolen as a direct result of a data breach (4iQ, 2019). A similar finding attributes one in three U.S. adults frequently worry about identity theft and believe their data has been stolen (NJ Business, 2020). A strong majority, 63 percent of the survey participants believed that prior data breaches could lead to future identity left and fraud. Seventy-three percent of the participants believed they had already fallen victim to fraud or identity theft (4iQ, 2019). Breaches happen because the specific data being targeted is often PII, and PII has value (Bush, 2016).

What is PII?

Personally-identifiable information is defined by the U.S. General Services Administration as any representation of information that can be used to potentially trace or identify a person, either alone or when combined with other information (General Services Administration, 2020). Personally-identifiable information can be organized and classified into an array of data points (General Services Administration, 2020). The types of data points that PII can include are a person’s name, social security number, driver’s license, passport number, any other government identification numbers, financial information, medical information, disability information, biometric identifiers, citizenship status, gender orientation, ethnicity, sexual orientation, account passwords, birthday, place of birth, telephone numbers, email, mailing address, home address, religious preference, criminal history, mother’s maiden name, martial information, child information, emergency contact information, employment information, educational information, and military records (General Services Administration, 2020). Some PII records are stored in databases that are connected to web platforms that are accessible by the public. Many companies market mined data that often includes PII to vendors who wish to target their consumers with advertisements (Crain, 2016).

Web Platforms                                   

Many companies within the U.S. create websites for users to register and further utilize platform features, such as purchasing items, chatting with other users, and playing games. When a user registers on a web platform, they create a username and password that belongs to them. Often, many platforms ask for more information such as first name, last name, address, and, in many cases, a mobile number. Other data fields can be used for services, such as newsletter opt-in, coupon distribution, or a nickname field. If a user decides to purchase items online, the user must sign up on the website and provide PII that is highly sensitive (Crain, 2016).

There is confusion about PII protection responsibility between the private and public sectors. From a legal perspective, protecting PII and privacy is either shared responsibility between the organization and the individual or the organization could have no responsibility (“Customer Information and Privacy,” 2020). For example, users of Amazon Web Services (AWS) cloud features have a shared responsibility between Amazon and the organization or individual using AWS (Bennet & Robertson, 2019). Many federal and private companies share the responsibility of protecting PII with employees. As an example, there may be a warning banner claiming that the system an employee is about to access is sensitive or classified. A warning banner or system disclosure is a message that the user must accept before being prompted to login to the system, which forces the end-user to accept responsibility (Amran et al., 2018). A warning banner means that both the platform and PII owner must be diligent in ensuring that their information is protected (Amran et al., 2018).

Private sector companies within the U.S., including web platforms, have the option to implement controls and guidelines according to company or individual needs, and those needs largely depend on the industry in question (Federal Trade Commission, 2020). For example, the Health Insurance Portability and Accountability Act (HIPAA) states that any medical PII in transmission or at rest must be encrypted. If an organization is caught storing or transmitting PII in an unencrypted format, the organization could be subject to hefty fines (Federal Trade Commission, 2020). Nearly every website that a user has access to has an option to sign up for extra access, and those sign-ups almost always ask for some form of PII data when registering. At the low end, a first name and email address are required, while the far end requires data such as a home address and phone number. As per the Federal Trade Commission (2020), non-government companies and web platforms are not held to any standards when it comes to the protection of PII.

Data Broker Practices

As consumers browse the internet for goods and services, they entrust a given company with their PII with a slight bit of hope the company will protect their information, to the best of their ability, regardless of the user entering information directly on the web platform or not (Chalk, 2020). Consumers fail to realize that a data breach does not have to occur for their PII to exit the company (Turow, Hennessy & Draper, 2018). Many companies within the U.S. willingly participate in purchasing and selling consumer data while consumers are unaware of what is taking place (Turow, Hennessy & Draper, 2018). The practice of collecting information about consumers from a variety of public and non-public sources, such as cookies and loyalty card programs, with the intent of selling the information for profit is known as data brokering (Turow, Hennessy & Draper, 2018). Data brokers create a profile for each consumer in a way that is like a Facebook or Twitter profile and store it in a classified database (Crain, 2016). Data brokers use third-party sourced information to fill in data points on each profile by using various types of sourced data (Crain, 2016). Oracle’s data broker entity, Data Cloud, releases a full buyer guide each year to update what type of data is for sale. Data types in the data broker industry are classified into six categories, such as behavioral, interest, intent, in-market, social, and mobile (Crain, 2016).

How Data Brokers Obtain Their Data

Data brokers can obtain part of their dataset by a user’s own supplied information. known as user input. Direct user input data on a web platform can consist of first name, last name, email address, and residential information that is required to complete the registration process on a given website (Chatzimpyrros et al., 2020). Many companies require such accurate information to ship items directly to the consumer (Chatzimpyrros et al., 2020). The other type of information the data broker agency will use is data that did not come from the user’s direct input.

Indirect methods, such as tracking cookies and scripts, are embedded into the platform and are designed to track user activity and behavioral interaction within the website (Funkhouser et al., 2018). Some examples of indirect data points are product interests, how a user may have interacted with certain advertisements, and what time of day the user is most active while logged in (Funkhouser et al., 2018). When a consumer browses a product online but does not purchase it, that activity is tracked, stored, and projected elsewhere because the data point collected on the consumer indicated an interest in that product (Chatzimpyrros et al., 2020). If a consumer visits another website, they may see an advertisement for the same product they were looking at on a completely different website (Funkhouser et al., 2018).

Legislation Gaps

There is no federal legislation limiting the type of PII that non-government companies can sell, but there are regulations that direct how data should be handled for companies within certain industries. For example, the HIPAA Privacy Rule is legislation currently in place that directs medical organizations to handle PII with extreme care and sets forth a set of national standards for the protection of certain healthcare information (HIPAA Privacy Rule, 2002). However, HIPAA does not prevent all disclosure of information to third parties and, under certain circumstances, could allow accidental data disclosure in the form of data loss (Iguchi et al., 2018). Under HIPAA, the U.S. government can use and disclose protected medical information without an individual’s consent (Iguchi et al., 2018). Required by law, covered entities such as research organizations, health oversight organizations, or any essential government department can use PII without consent and without providing knowledge to the individual (Iguchi et al., 2018). For example, U.S. law fails to mitigate the sale of social security numbers to third parties (McCallister, Grance & Scarfone, 2020).

One example of user data being displayed on the internet and inviting the threat of cyber-crime is Whitepages and platforms alike. Whitepages is one of the largest online public directories and is utilized by 35 million people each month, according to their website. As a result of such high traffic and usage of Whitepages, PII for millions of individuals is available for anyone’s viewing (Waddell, 2017). Whitepages often includes information, such as name, address, telephone number, and connected associates, all of which are classified as PII. There is no legislation preventing easy access to sensitive information broadcasted for public view on platforms such as Whitepages. Figure 1 represents a Whitepages profile with accurate data on the search “Jordan Vashey.”

REDACTED FIGURE 1 IMAGE

Figure 1. Jordan Vashey’s Whitepages.com profile. The profile displays accurate PII. Adapted from “Whitepages.com”, Copyright 2020 by Whitepages, Inc.

The address in Figure 1 is an accurate portrayal of what a cybercriminal could view if the term “Jordan Vashey” is inputted in a search engine. Many platforms like Whitepages purchase PII from data brokers with the intent of making it public to generate profit (Boerman et al., 2017). Cyber criminals look for sources of information on the internet that can be exploited (Ducasse, 2017). Personal data is a high-value commodity for criminals seeking to obtain social identities on the dark web (Boerman et al., 2017).

Kaveh Waddell (2017), a reporter on technology for The Atlantic, reported that “websites like FamilyTreeNow.com and Whitepages.com make it easier for criminals to stalk someone online.” Platforms such as FamilyTreeNow.com and Whitepages display any information possible to form a complete profile (Waddell, 2017). There are legal limits that exist regarding how data can be used but not how the data falls into the possession of an organization like FamilyTreeNow. In the U.S., the Equal Employment Opportunity Commission Publication (EEOC) states that there are legal limitations on using certain PII to evaluate someone for a job opening (Fair Credit Reporting Act, 1970). For instance, it would be illegal for an employer to use PII such as marital status, which is often displayed publicly on social media or public directory service platforms, to determine if a candidate is eligible for a job (Fair Credit Reporting Act, 1970). According to FamilyTreeNow’s Privacy Policy, much of the information displayed on the platform comes from purchasing data that is then stored on the company infrastructure. According to Sergiu Gatlan of Pew Research, a web platform administrator that implements a privacy policy is not required to do so by law (Gatlan, 2019). Aaron Smith of Pew Research reported that 50 percent of survey participants did not know what a privacy policy is (Smith, 2014). Data can leave an organization either intentionally or unintentionally. An unintentional method could be a company that suffers a breach and has PII stolen, which could later be put on the dark web for sale (AlKhatib & Basheer, 2019). It is difficult for U.S. companies to secure PII because datasets can be so large and complex, which leads to both intentional and unintentional PII leaks (Vimala, Roselin & G.M. Nasira, 2020).

PII Policy and Technology Gaps

Many policies exist as a framework to assist organizations in protecting PII, but these policies are not mandatory unless the organization is a federal agency within a specific industry. The National Institute of Standards and Technology (NIST) has many policies and frameworks dedicated to cybersecurity that companies may use freely and openly. The NIST Cybersecurity Framework v1.1 is one of the leading guidelines in the U.S. for cybersecurity, yet companies are still falling victim to cyber-crimes by the hundreds daily. Seventy-five percent of organizations in the U.S. suffered from a cyber-attack via an outsider in 2018 while 25 percent of U.S. companies experienced an attack from an internal actor (The Council of Economic Advisers, 2018). Most U.S. companies are failing to implement baseline controls to prevent basic cyber-attacks, resulting in unprotected PII (Goodwin & Smith, 2020).

As a result of the lack of enforced security, privacy, and mandatory policies in the U.S., 63 percent of businesses in the U.S. experienced phishing and social engineering attacks in 2018 as the number one attempted methods of cyber-attacks (Sobers, 2020). Only five percent of company share folders were properly protected in 2019 (Sobers, 2020). Data breaches exposed 4.1 billion records in the first half of 2019, and many contained PII (Sobers, 2020).

As another example, 500 million Marriot-Starwood consumers experienced a data breach in 2018 that included PII, such as names, addresses, and financial information (Sobers, 2020). In 2019, Verizon reported that 15 percent of data breaches that included PII being stolen belonged to the healthcare industry (Verizon, 2019). In 2019, the U.S. witnessed 15 percent of data breaches caused by authorized users, or users who had ample access to a system or network and caused a breach unintentionally (Verzion, 2019).

Purpose of the Study

The purpose of this research is to describe the process of data mining, provide examples of U.S. organizations that engage in data mining practices, provide methods to remove PII from the internet, and propose privacy preservation techniques to enhance security when handling PII. Currently, there are controversial gaps in the way current U.S. laws loosely regulate the handling and transfer of PII (McCallister, Grance & Scarfone, 2020). As a result, there are minimal regulations to prohibit the profit of private sector organizations selling PII. Consumers are unaware personal data is easily-accessible online by anyone willing to search or pay for access to that data (Turow, Hennessy & Draper, 2018). While consumers are generally aware that data breaches are an imminent threat when using the internet, many are blissfully unaware of data mining and related practices that could put their PII and privacy at risk (Turow, Hennessy & Draper, 2018).

Research Questions

The research questions that are explored in this paper are the following:

Q1.What actions should the U.S. government take to address profiting from the sale of PII?
Q2.What types of organizations engage in data mining practices?
Q3.What technological methods can organizations implement to further protect PII?
Q4.What techniques could be used to remove PII from public directory platforms and search engines?

Literature Review

An organization’s growth is measured by the yearly increase or decrease in earnings and company profit heavily relies on the competitive advantage over competitors, which often results from big data adoption. Big data adoption is the process by which a company will seek innovative ways to enhance productivity and predict consumer interest (Baig et al, 2019). From 2010 to 2020, the internet has increased from 28.8 percent of the world population having internet access to 58.7 percent (“World Internet Statistics”, 2020). The evolution of data mining algorithms placed in many web platforms means the amount of data that can be collected from one individual is astronomical. Data mining has evolved over the last decade and managing that data is becoming extremely difficult. The more difficult the data is to manage, the more prone it is to be leaked or unintentionally disclosed. Data that is difficult to manage imposes a great security risk (Bertino & Ferrari, 2017). Data collection and loose government regulation allow companies to invade the privacy of every user who is uneducated on the privacy topic (Anderson et al., 2017). This study includes research from several sources, such as research findings, scholarly journals, news articles, surveys, and industry reports that provide support to demonstrate whether data mining and profiting from selling it require further regulation than what is already in existence within the U.S.

Research within this paper will begin with the establishment of what data mining is, the components of data mining, and what types of companies use data mining techniques. After the explanation of data mining techniques, this paper will outline the few privacy laws and regulations, including how they affect the data mining process and whether such laws are currently working to protect user PII. This research will highlight types of companies and include some well-known organizations to provide examples of how data mining is used within these industries. Lastly, this research will provide methods to assist in removing PII from a popular web platform, search engines, and public directories to promote privacy on the internet and provide methods to deter data mining operations from gathering accurate PII data.

Data Mining

User activity on the internet increases each year, which results in a heightened need for more products and services that enable a need for mined data (Kosinski, Wang, Lakkaraju & Leskovec, 2016). Data mining operations provide interesting and useful patterns from statistics and artificial intelligence algorithms (Zhang, 2018). Mined data in finished form has a high value, and because of its value, PII, such as behavioral usage, location data, and other high-value data points tied to an individual identity, is often targeted during data mining operations (Boerman et al., 2017). Data mining algorithms are implemented in many web-based platforms utilized by hundreds of millions of users daily (Han, Kamber & Pei, 2012). Data mining as a knowledge discovery process has five to seven crucial steps for the algorithm to complete its data collection successfully (Han, Kamber & Pei, 2012). The number of steps is based on the algorithm being used (Kumar & Reinartz, 2018). Steps may be combined as not every data mining organization or company that mines their own data functions the same way or utilizes the same data mining process.

Raw data harvest. The algorithm will harvest raw data about a user that must be transformed into a useable product to sell but processing noise out is mandatory before the sale (Kumar & Reinartz, 2018). Many data collection algorithms must have noise reduction capability (Han, Kamber & Pei, 2012). Noise in the data collection industry is any unwanted data that is deemed unnecessary to collect about the user (Han, Kamber & Pei, 2012). In some data collections, PII can be considered noise, but many organizations choose to mine PII and not discard it as noise. For example, certain algorithms may only collect approximate age groups of users and disregard pet data. The noise reduction capability in the algorithm will either disregard or discard any data about pets for that user. Within a data collection algorithms’ raw data harvest function, quality can be checked to ensure the algorithm is tracking the correct target to mine (Kumar & Reinartz, 2018).

Data integration. The value of data increases when combined with other data (Miller, 2017). Data integration capabilities must be utilized when multiple sources of data need to be combined (Han, Kamber & Pei, 2012). Combining data offers a reduction in storage as well as completing a comprehensive profile of the user’s data that is being collected, and many data points can be connected as a combined data mining package (Miller, 2017). For instance, a popular combined data point is user age and Facebook profile. Facebook can determine an approximate percentage of users per age by having data points. Facebook could sell these data points to an organization that may then use that information to advertise to certain age groups.

Data collection. Data collection is responsible for sifting through raw data and determining what data is useful and what is not (Han, Kamber & Pei, 2012). Again, the example of pet data and age groups could be used. If the target is an age group, the data selection function in the collection algorithm will not utilize pet data. Pet data may or may not be disregarded, depending on the mining algorithm and what its developers’ intentions are.

Data transformation. Data transformation is where the raw collected data is transformed and consolidated into forms appropriate for further processing (Stagner et al., 2020). In this step, data transformation can be exported to many useable forms for later use (Stagner et al., 2020). An example of transformed data could be targeting the age group of people who are over 60 years of age. The function in the data transformation step will assume the entirety of the bulk data and sift for anyone who meets the specific requirements of the algorithm. In this example, the user requirements are anyone who is age 60 and above. Data that the algorithm deems useless is either discarded or ignored via the noise reduction capability most algorithms deploy in present-day mining operations (Han, Kamber & Pei, 2012). Appropriate user data is then transformed into a format that is useable by other mining software.

Pattern analysis. Pattern analysis is where the data mining magic happens. Pattern analysis is an essential piece of the algorithm that applies intelligent methods to extract data and evaluate patterns (Kosinski, Wang, Lakkaraju & Leskovec, 2016; Stagner et al., 2020). Intelligent methods can include specific groups of data that are combined to determine if someone would be interested in a specific product (Kosinski, Wang, Lakkaraju & Leskovec, 2016). One of the main purposes of a data mining algorithm is to determine patterns in the behavior of a user on the internet for the purpose of marketing.

Pattern organization. Pattern organization in data mining refers to the practice of identifying patterns that represent knowledge based on an interestingness measurement (Kumar & Reinartz, 2018). An interestingness scale is a measurement of formed data that is given a rating on how useful that data is (Kosinski, Wang, Lakkaraju & Leskovec, 2016). An organization can effectively use mined data to target advertisements to the user because a data mining algorithm determined that a group of consumers within a certain age group may be interested in a specific product (Kumar & Reinartz, 2018).

Knowledge presentation. One of the final steps in a data mining procedure is knowledge presentation. In this step, data is taken out of raw form and put into a format that is easily understood by a platform and is used by an organization to make an informed decision (Stagner et al., 2020). Many people or platforms cannot understand raw code until it has a graphical interface, and the same type of misunderstanding can be said about data in presentation form. Mined data cannot be understood when it is not in proper presentation form (Kosinski, Wang, Lakkaraju & Leskovec, 2016). When data is cleaned, integrated, organized into patterns, and formed for presentation, it can then be formed into a visual dataset. In this step, the client organization who is purchasing the dataset from the mining company may inspect the product before purchase (Kosinski, Wang, Lakkaraju & Leskovec, 2016).

Organizations that Engage in Data Mining Practices

For a company to market new products and services, the company must have data that is specific to a group of users to whom the company would like to market the products (Boerman et al., 2017). For example, if Walmart wished to market a new Samsung television to a group of users who are aged 60 and above with an interest in streaming, it would require a dataset that includes that demographic. A data-mining firm may have a dataset of users who fit this demographic available for sale. Walmart must purchase the dataset and organize it into their advertisement system to target that specific group of users belonging to specific demographics (Sutar, 2017).

However, not every organization or platform utilizes mined data to push new products and services. In contrast to organizations that use mined data, many platforms do not purchase mined data and revert to earlier methods of advertisement, such as search engine advertisements. Organizations and platforms must generate traffic whether it be in-store or online to make a profit (Jackson, 2017). Advertising without mined data via a search engine is a good example of how companies do not have to use invasive data-mining practices and how some companies respect user privacy.

Social media. Social media platforms are great vendors to purchase mined data from, as many of these platforms deploy intelligent mining operations on their users that can be shared with other organizations without direct user consent or knowledge (Sutar, 2017). As of January 31, 2020, Facebook’s user population was roughly 2.2 billion (About Facebook, 2020). Social media is one of the biggest culprits in privacy disruption as platforms like Twitter, Facebook, Instagram, and Vine have access to PII if the user inputs correct information and did not use a fake name. Fake profile information will be explored later in this research as a method to deter data mining operations from collecting PII. However, Facebook recently admitted to data mining operations that are questionable in terms of ethics (Shahani, 2018). A representative from Facebook sat down with Aarti Shahani, a reporter for National Public Radio (NPR) in 2018 to discuss how the organization allowed a gaming app to retrieve contact profile information without consent (Shahani, 2018). Facebook directly allowed an application to collect contact profile information of roughly 270,000 users (Shahani, 2018).

Michelle Willson and Tama Leaver (2015) found that a U.S. company, Zynga, produced a game titled Farmville that massively mined data about its users for profit. Zynga was one of the largest gaming platforms on Facebook in 2015, and the company intentionally hoarded data on users and sold it to companies wishing to market products with advertisements (Willson & Leaver, 2015). In contrast with the detrimental aspects that social media can provide, the U.S. Census Bureau uses social media to reach audiences and share critical information (U.S. Census Bureau, 2020). Facebook and Twitter were widely responsible for sharing the 2016 U.S. Presidential election results to hundreds of millions of Americans (Bossetta, 2018). Social media allows users to connect with other users globally, share photos, and play games, all of which enables communication across both state and country borders.

Law enforcement and large data usage. Large completed datasets have increasing benefits to law enforcement because social media and genealogy websites have increased in popularity. There are now websites where users may make a profile, have a blood test conducted, and upload their DNA profile to the platform with the hope of finding unknown ancestors. Law enforcement officials have deployed tactics known as open-source intelligence (OSINT) techniques to discover information about suspects (Hassan, 2019). For instance, if someone is suspected of drug trafficking, U.S. law allows law enforcement officials to browse public social media profiles even if PII is publicly displayed on the open internet. Often, users fail to realize their profile may be public, which allows law enforcement to openly view the user’s profile (Hassan, 2019).

Some law enforcement experts disagree on using such information to gather intelligence on a suspect and believe gathering intelligence via social media and genealogy websites are an invasion of privacy (Guerrini et al., 2018). In 2017, Joseph DeAngelo, now known as The Golden State Killer, was apprehended and accused of being a serial killer based on evidence gathered from the genealogy platform GEDmatch (Guerrini et al., 2018). GEDmatch allows a user to create a profile on their platform and a user to upload their DNA profile (Phillips, 2018).

DeAngelo was captured by the procedure known as genealogy mapping, or the identification of an individual by ancestors who could be linked to a DNA profile if related to by blood (Guerrini et al., 2018). A DNA profile is classified as PII (U.S. Department of Energy, 2020). However, no law disallows usage of DNA data provided on websites such as GEDmatch. As details of DeAngelo’s arrest emerged, privacy concerns rolled in regarding whether the U.S. government should have access to ancestral data for the use of capturing criminals. One question that emerged is, should U.S law enforcement have access to genealogy platforms such as GEDmatch? To date, public opinion on this matter has not been addressed. Research shows that of the 1,587 participants questioned whether the U.S. Government should have access to GEDmatch, 79 percent of participants supported the U.S. government having such access (Guerrini et al., 2018).

Public directory services. A good example of profit-based platforms that mine and purchase data are public directories, and these platforms are a tremendous risk concerning user privacy (Cappel & Shah, 2017). Companies such as Whitepages and FamilyTreeNow allow users to obtain PII directly from the website. For example, if a user wanted to find a person’s phone number, they could visit FamilyTreeNow.com (Cappel & Shah, 2017). The user will then be prompted to enter a first and/or last name and city (Cappel & Shah, 2017). Once the user queries the results and finds the intended person, the user will then have access to what can be an abundance of PII. On many occasions, this information will include a primary residential address, telephone number, and associates. The returned result could be like Figure 1, a public directory platform displaying PII for a returned search query.

Marissa Lang (2017), a San Francisco Chronicle journalist, covered how FamilyTreeNow and other platforms alike shred privacy but noted that there is no legislation preventing the misuse of PII openly displayed. In one case from 1991, a stalker shot and killed a 20-year-old female after purchasing her information from a directory service that sold data (Waddell, 2017). FamilyTreeNow and platforms alike offer similar services that allow discovering a family tree without an individual’s consent (Waddell, 2017). Public directories can offer background profiles on family members without consent (Waddell, 2017). Platforms like FamilyTreeNow open the gateway of possibility that stalking could occur (Waddell, 2017).

Search Engines. Saima Salim of Digital Information World reported that search engines are some of the most widely visited websites in the world (Salim, 2020). Search engines like Google and Bing can deploy data mining techniques because search engines are so widely visited. In 2020, Statista reported that Google continues to rank as the number one search engine in the world and is projected to stay that way until the year 2025 (Statista, 2020). Platforms such as Google and Bing can deploy measures to capture location history and can be used to identify an individual’s approximate location, which is deemed PII (General Services Administration, 2020). In 2016, the U.S. Court of Appeals for the First Circuit held that Global Positioning System (GPS) data is included within the scope of PII, furthermore, noted that a user of a mobile application or service that is not logged in is included within the scope of PII (Yershov v. Gannet Satellite Information Network, Inc., d/b/a USA Today, 2016).

Data Mining Security Risks

Every industry has its own information systems, and information security is critical regardless of mining data or not. Data mining can be a dangerous task because mined data often contains highly-sensitive information, such as PII (Kosinski et al., 2016). There are many ways PII can intentionally or unintentionally exit an organization that may be considered a breach, depending on how the mined data exited the organization (Crain, 2016). The first method mined data can exit an organization is intentionally for profit (Han et al., 2012). Organizations that are equipped to mine data will often include PII mining and usually have the intent to sell collected datasets (Han et al., 2012). One example would be that of a data broker company, such as Acxiom, needing to find a buyer for a dataset. Brokers like Acxiom will find a company to purchase a dataset, and then arrange a deal for the data to be sold and transmitted to the buyer.

One of the largest risks in data is how the anonymization of data is performed (Torra & Navarro-Arribas, 2016). Data anonymization is the process of keeping the identity of a user hidden (Torra & Navarro-Arribas, 2016). An illustration of data anonymization could be a school wishing to publish grades about a certain demographic of students without revealing the students’ identities. Data anonymization is an extremely difficult task to perform because the algorithm must be capable of performing anonymization on massive amounts of data (Torra & Navarro-Arribas, 2016). Another risk that lies within data mining is the use of encryption. Many U.S. based companies are not encrypting basic functioning internet services like a web server or email server. Connor Reynolds (2020), a reporter for the Computer Business Review, noted that fewer than 30 percent of U.S.-based companies deploy encryption. The low encryption rates are often blamed on the systems and networks being too complex (Reynolds, 2020).

Publications and Guidelines for Company Data Security

The National Institute of Standards and Technology’s (NIST) Special Publication (SP) 800-122, “Guide to Protecting the Confidentiality of Personally Identifiable Information”, provides recommendations for federal organizations to enhance the security of any network or system that transmits or stores PII (McCallister et al., 2020). The NIST SP 800-122 provides the same guidelines as recommendations to private companies to follow for a baseline of security (McCallister et al., 2020). Privately-owned companies may use NIST SP 800-122 to enhance security controls within the enterprise but are not required to do so (McCallister et al., 2020). The NIST SP 800-122 provides security control recommendations on access control, audit and accountability, identification and authentication, media protection, planning, risk assessment, and system communications protection (McCallister et al., 2020).

Another information security guideline from which a company may adopt its security controls is the International Organization for Standardization and the International Electrotechnical Commission (ISO/IEC) 27001 Information Security Management Systems (ISMS) (International Organization for Standardization, 2020). ISO 27001 is a well-known set of standards and tools to assist an organization in securing itself from data leaks, data theft, and more. Using the ISO 27001 controls can enable a company to manage financial information, intellectual property (IP), and PII. The ISO 27001 series contains more than 12 control families that range from encryption controls, human resources security, and how data should be classified according to sensitivity (ISO/IEC 27001, 2020). In the ISO 27001 control family, PII data is classified as the highest sensitive material in the organization and should be carefully safeguarded to prevent loss. When PII is stored or in transmission, both NIST SP 800-12 and ISO 270001 recommend that PII be encrypted with a strong algorithm but is at the organization’s discretion to implement and is not mandatory for private companies (NIST SP-800-12; ISO/IEC 27001, 2020).

Like many recommended guidelines or publications put forth by NIST, security vendors, and other organizations alike, SP 800-122 and ISO 27001 may not work for every organization as intended because private sector companies may not be required to implement recommended publications or guidelines into a company security plan. For example, both SP 800-122 and ISO 27001 provide mandatory guidelines for certain federal entities to securely protect PII but fail to extend to private sector companies allowing private organizations to self-regulate (NIST SP-800-12; ISO/IEC 27001, 2020). Any publication made as a set of recommendations may not always work for every organization. Today, many organizations are building and running custom software that may not have the best security practices in mind when handling mined data that often includes PII. Regulations and guidelines put forth by NIST and ISO/IEC are not mandatory for private companies, nor are the guidelines a form of legislation to be enforced by law enforcement officials (NIST SP-800-12; ISO/IEC 27001, 2020).

Existing U.S. Legislation

Financial sector companies are directed to maintain compliance with the Federal Financial Institutions Examination Council’s (FFIEC) standards and regulations (FFIEC, 2020). Organizations that deal with health-related PII, such as a doctor’s office, must abide by guidelines and laws such as the HIPAA Security and Privacy rules and would not be able to sell mined data. (“HIPAA Privacy Rule”, 2020). However, the U.S. Government could deploy mining operations targeting health-related PII for research under the routine clause rule of the Privacy Act of 1974 (“Privacy Act, Office of Privacy and Open Government, U.S. Department of Commerce”, 2020).

The Privacy Act of 1974. The Privacy Act of 1974 was a step forward for consumer data privacy (Solove, 2016). The act corrected many privacy concerns, such as limiting the collection and use of PII and records by federal agencies (Solove, 2016). One gap with the Privacy Act of 1974 is that it does not include the private sector, only certain federal agencies (Privacy Act of 1974, 1974). The Privacy Act of 1974 allows individuals to correct personal information in any federal database, not in a private company database (Privacy Act of 1974, 1974). The act made important strides in privacy but does not include most companies in the U.S. as most companies in the U.S. belong to the private sector. Another shortcoming of the Privacy Act of 1974 is the routine use exception where information has the possibility of being disclosed for “routine use” if the disclosure is compatible with the purpose for which the agency collected the information (Solove, 2016). By having a routine use disclosure, the Federal government has the power to override the Privacy Act of 1974 when deemed routine (Solove, 2016).

California Consumer Privacy Act. An example of positive action for consumer privacy rights is that of the California Consumer Privacy Act (CCPA) (CCPA, 2018). Under the CCPA, any resident within the State of California can remove their PII from any database, not just a federal database. Under the CCPA, the law allows for any user within the U.S. to remove their data if the entity resides in California. California has taken the first step toward better privacy than what the U.S. currently offers. Another positive that CCPA offers is the right to know what personal information is collected, used, shared, or sold (CCPA, 2018). The right to delete personal information held by businesses offers consumers the peace of mind knowing their PII is no longer stored in a database for that specific organization in California. The right to opt-out provides consumers the right to stop an organization from selling their PII (CCPA, 2018).

Techniques for Removing PII Exist

There are methods to remove PII from many data mining firms, public directories, and search engines, but these can prove difficult to complete. Removing data from a search engine such as Google or Bing requires data to be removed from the source. Search engines only index material already present on the internet Google, 2020). For a website to be indexed via Google or Bing, the web page must be live or publicly visible without any form of protection, such as a password or restrictive directory permissions the general public cannot access (“Remove information from Google; Removal from Bing”, n.d.).

Removing data from a search engine is generally the same process for numerous search engines like Google and Bing (“Remove information from Google; Removal from Bing”, n.d.). For example, if a user wished to remove their information from Google, the user must contact the search engine help desk with a link to the profile with an explanation of why the user would like the link taken down from Google Search (“Remove information from Google ” 2020). Google has complete discretion and may not comply with the user’s request for data removal. Google states that removing information from the platform does not remove data from the internet itself, only from being searchable by the Google Search engine (“Remove information from” 2020).

Removing a search engine indexed public directory profile at the source could be more difficult than removing the link at the search engine (Vaughn & Chen, 2015). A search engine indexes a web page and, when certain keywords or phrases are inputted, content is then displayed in numerical order via page rank (Vaughn & Chen, 2015). If a user were to search for their own first and last name in quotations, the user may notice that historical content pertaining to that search may appear in the results, such as old social media profiles, comments on social media that are public, and photos that may have once been public. To remove data at the source, a user must first identify his or her profile. For instance, if a user finds their profile on FamilyTreeNow and notices that personal data is present, he or she may choose to opt-out of having that data displayed (FamilyTreeNow, 2020). Many public directories like FamilyTreeNow and Whitepages (Whitepages, 2020) provide various opt-out methods so a user can remove their profile from the platform’s database.

Summary

The purpose of this research is to describe the process of data mining, determine if there are any U.S. data privacy laws, identify whether data mining requires further legislation, provide examples of U.S. organizations that engage in data mining practices, provide methods to remove PII from the internet and propose privacy preservation techniques to enhance security when handling PII. Datasets are generally derived from data mining operations that identify PII, collect it, and package the data for sale. Data mining is the exploration and analysis of large data to discover meaningful patterns that can be used to form datasets and process them for sale (Kosinski, Wang, Lakkaraju & Leskovec, 2016). Currently, there are controversial gaps in the way current U.S. laws loosely regulate the handling and transfer of PII (McCallister, Grance & Scarfone, 2020). For companies to protect the PII of all users, companies must have access to standards and regulations put forth by industry experts. The NIST SP 800-122, and ISO 27000 security control series are a great start for any organization to secure itself from an array of cyber-attacks (NIST SP-800-12; ISO/IEC 27001, 2020).

The regulation of PII protection on the internet is currently loosely regulated, which allows organizations to mine and sell data, which often includes PII (Turow, Hennessy & Draper, 2018). The would-be legislation, which would further prevent mining and selling of PII, proves scarce. Throughout this research, it was discovered that, while there are laws governing companies and groups at the federal level, however, the private sector is greatly lacking hard-based privacy legislation (McCallister, Grance & Scarfone, 2020). While this lack of legislation is not ideal, it does allow for some benefits (Hassan, 2019). For instance, the lack of private-sector legislation allows law enforcement to access public profiles and the ancestral DNA of potential suspected criminals. Lastly, this research investigated current practices of data mining and selling, outlining these practices in detail. With all the PII present on the internet viewable for the public to view, securing data becomes an almost impossible task to do properly (AlKhatib & Basheer, 2019). However, there are methods should a user choose to remove a record from Google (Remove Information from Google, n.d.).

Discussion of the Findings

The purpose of this research was to describe the process of data mining, provide examples of U.S. organizations that engage in data mining practices, provide methods to remove PII from the internet, and propose privacy preservation techniques to enhance security when handling PII. Currently, there are controversial gaps in the way U.S. laws loosely regulate the handling and transfer of PII (McCallister, Grance & Scarfone, 2020). As a result, there are minimal regulations to prohibit private sector organizations from selling PII. Consumers are unaware personal data is easily-accessible online by anyone willing to search or pay for access to that data (Turow, Hennessy & Draper, 2018). While consumers are generally aware that data breaches are an imminent threat when using the internet, many are blissfully unaware of data mining and related practices that could put their PII and privacy at risk (Turow, Hennessy & Draper, 2018).

Consumer Unawareness

The research concluded that consumers who purchase items online within the U.S. are, by a majority, uneducated in internet privacy and do not know the full implications of having their PII collected and shared, which can result in an increased risk of PII being stolen and often leads to identity theft. According to Aaron Smith (2014) who assisted Pew Research in measuring public knowledge and the web within the U.S, concluded that 50 percent of the participants did not know what a privacy policy is (Smith, 2014). Sergiu Gatlan (2019) of the Pew Research Center reported that a web platform that implements a privacy policy does not have to do so by U.S. law which allows web platform owners to negate the platform from having a privacy policy. Not having a privacy policy implemented into websites on U.S. servers who serve U.S. consumers is a compliance issue and should be a law issue (Gatlan, 2019). Consumer unawareness is a major piece in protecting PII. The consumer must be aware their data is being mined and solid for profit.

Existing Legislation

Current research continues to show that the lack of a blanket U.S. privacy law that remedies organizations profiteering from collecting and selling PII or platforms that allow the public to view PII, such as a telephone number or personal address, grows more dangerous each year. The internet continues to grow daily and, while the internet is increasing, managing the balance between optimal security and privacy practices challenges expand. Managing big data is a difficult task to operate entirely, however it is not impossible to do. Research highlights a dire need for a blanket U.S. privacy law that prevents PII profiteering from organizations that choose to participate in data mining operations. The privacy system at the executive level of the U.S. Government is broken and allows millions of records to be exposed yearly due to accidental and intentional loss.

If data is lost intentionally, for example, a disgruntled employee who was then fired steals the company customer list and uses it for their next job, they will face certain punishments under the Computer Fraud and Abuse Act for breaking that law. Unintentional data loss does happen but is not punishable unless specific federal compliant regulations, such as HIPAA’s or FFIEC’s published guidelines, were infracted upon such as resulting in a security breach. However, an organization can elect to share responsibility for PII protection with the user. For instance, if a customer signs up on a web platform, a privacy policy may state that consumer data is a shared responsibility between the platform and the user. Research highlights problems with this model of approach because, in the event of a data breach, its attribution, or laying blame proves difficult. There needs to be a U.S Federal law that mandates PII not be sold for profit to public directories, and other platforms alike.

Privacy Act of 1974

The research concluded that the Privacy Act of 1974 was a step forward for consumer privacy, however, the act fails to cover many companies within the U.S, as most companies belong to the private sector and are not operated by a government entity (Solove, 2016). Within the Privacy Act of 1974, the routine use rule allows the government to override the minimal PII protection to use U.S. consumer PII for any reason deemed necessary, such as research or making public decisions for public benefit. Specific industry regulation gears toward federal entities only. Research showed that the Privacy Act of 1974 is outdated and lacks modern coverage of technology in 2020, such as autonomy and machine learning-based algorithms (Solove, 2016).

The Privacy Act of 1974 does have benefits, such as the allowance for any user to correct information in a federal database, like CCPA to correct data in any database housed within the state of California (CCPA, 2018). The main difference between CCPA and the Privacy Act of 1974 is that CCPA allows a user to correct PII in any database regardless of whether the database belongs to a public or private sector organization. The Privacy Act only allows correction for federal databases. One result of a lack of a blanket U.S. privacy law is that many organizations choose to participate in data mining practices for a multitude of reasons that can include advertising specific products based upon a user interestingness metric result or simply to sell a completed data package about a specific group of users.

Organizations that Engage in Data Mining Practices & Compliance

The research concluded that many organizations participate in some form of data collection whether it is data mining, scraping, or diagnostic data used for solving software issues. Research showcased that depending on the industry the organization operates within, the company could be subjected to specific U.S. regulations regarding the protection of PII. For example, a company that is a privately-owned organization, such as Whitepages, Inc., would not have to report yearly security assessments to government oversight because Whitepages does not need to comply with Federal regulations, such as HIPAA or FFIEC. Whitepages, Inc., and companies alike do not have to abide by HIPAA because Whitepages does not display any health data about an individual (HIPAA, 2019).

FamilyTreeNow is an example within this research that proved web platforms can be dangerous. In one case from 1991, a stalker shot and killed a 20-year-old female after purchasing her information from a directory service that sold data (Waddell, 2017). Anyone with internet access has access to PII on platforms alike, not the U.S. Government alone. Crimes have been committed with PII obtained from public directories, such as Whitpages and FamilyTreeNow, yet there is presently loose regulation that does not prevent PII from being displayed openly.

Some entities deal with PII for positive reasons, such as managing health information or easy access to banking records. The research highlighted that entities that deal with health-related PII or financial PII must abide by industry-specific guidelines or government regulations. For example, HIPAA covers any healthcare organization that handles sensitive patient health PII. The Health Insurance Portability and Accountability Act is the result of years of U.S. Government and private sector joint research and is accepted as an industry-standard in computer security. Prior to HIPAA, there were no generally accepted sets of security and privacy standards for health care entities in the U.S. (HIPAA, 2002). Often, health-related PII includes diagnostic information, family history of illness, and current medications the patient may be taking, all of which are classified as PII by GSA (Government Services Administration, 2020).

Such laws like HIPAA do not prevent all disclosure of protected information to third parties and, under certain circumstances, could allow accidental data disclosure in the form of data loss (Iguchi et al., 2018). Under HIPAA, the U.S. government can use and disclose protected medical information without an individual’s consent (Iguchi et al., 2018). For example, the U.S. Government could use PII for research purposes; however, even data controlled by the U.S. Government could be prone to data loss. Healthcare companies that implement all HIPAA safeguards have a good security posture in place if done correctly. Laws like HIPAA and FFIEC are well-enforced by law, meaning if a company suffers a security breach and is found to be negligent, for example, using weak encryption algorithms, the company could be fined under HIPAA Security and Privacy rules for failing to maintain HIPAA compliance regarding encryption use (HIPAA Security Rule; HIPAA Privacy Rule, 2002).

There are many good practical applications of PII use for public benefit. For example, research showed that U.S. law enforcement is using techniques to catch criminals that involve PII usage. In 2017, Joseph DeAngelo, now known as The Golden State Killer, was apprehended and accused of being a serial killer based on evidence gathered from the genealogy platform GEDmatch (Guerrini et al., 2018). GEDmatch allowed U.S. law enforcement to access their database containing DNA and genealogy records that assisted the police in the identification of DeAngelo. DeAngelo operated for over 30 years under the police radar because they did not have enough evidence to link someone to a DNA profile on record. GEDmatch allowed law enforcement to complete the match by providing PII access to law enforcement who then used the data to make a criminal accusation that later turned into a conviction in U.S. court (Guerrini et al., 2018). Another positive highlighted within this research is the application of using PII for research. Under the Privacy Act of 1974, the routine use clause enables the U.S. government to use PII when conducting research, so long as the research is done in a secure way (Privacy Act of 1974, 1974). For example, the U.S. must count its population every 10 years, otherwise known as a census. The U.S. Census Bureau is responsible for accounting for all people within the United States to assist in budgeting purposes, healthcare, and many other benefits that stem from having a population accounted for.

Financial Sector & FFIEC

The research revealed that FFIEC guidelines are a set of standards for online banking issued in October 2005 by the FFIEC. The goal of the recommended standards and guidelines was to increase security within financial sector organizations. Research showed that the FFIEC standards require multifactor authentication and encryption requirements to maintain compliance. Financial sector companies within the U.S. must abide by certain security and privacy guidelines published under FFIEC. Sensitive PII should be encrypted in transmission and at rest; however, many companies are still failing to implement encryption correctly or at all. Sensitive information within the financial sector is still being targeted and attacked successfully resulting in PII being leaked on to the internet for anyone to view.

Industry Standard Security Controls for PII

The ISO 27001 ISMS and NIST SP 800-122 guidelines both have many recommended methods, techniques, and controls that an organization may freely use to strengthen cybersecurity. The main difference between ISO 27001 and NIST SP 800-122 is that ISO 27001 is more technologically focused on security controls, whereas NIST is more focused on risk. Both publications are great for any organization that wishes to strengthen cybersecurity to assist in preventing PII loss from unintentional or intentional means. Both publications recommend strong encryption controls when handling PII.

Techniques for Data Removal

The research highlighted that there are massive amounts of PII on the internet. Whether a company has been breached or the company engages in data mining practices, a consumer should consider taking the time to remove their PII from common points of access (Zhang, 2018). Many platforms do provide opt-out measures but are not required to do so by law unless the company is registered within the State of California and must maintain compliance with the CCPA. Many public directory platforms have different opt-out or data removal tools for a user to remove their PII. In many situations, but not all, PII data can be removed with a request to the search engine or the platform, like a public directory.

Search Engine PII Removal

There are two methods to remove data from Google and search engines alike. Research showed that removing data at the source, such as removing a public directory profile that contains PII directly from the public directory is better than removing the link from Google only. When a user removes the link from Google, the user fails to remove the data source. Search engines, such as Google, Bing, Baidu, and DuckDuckGo, index the internet. When the pages are published, for example on a blog, search engines can crawl the website and search for new pages. The search engine will then put that new web page into the search engine database, making the page searchable.

Removing a link from Google and other search engines alike is not difficult. Google provides a webmaster tool to remove outdated content, personal information, or content with legal issues. Figure 2 provides an example of how data can be removed from Google.

FIGURE 2 REDACTED IMAGE.

Figure 2. Remove content from Google. Adapted from https://www.google.com/webmasters/tools/removals, Copyright 2020 by Google.

The second method to remove a searchable entry from Google and many similar search engines is by directly removing the source. For example, if a user wanted to remove a FamilyTreeNow entry, that user must find their profile to remove it. FamilyTreeNow and other platforms alike may offer an opt-out procedure, but may not be required to do so within the U.S. When the opt-out procedures are followed, the platform may specify a certain number of hours or days before the entry removal request is approved. Once approved, research showed that FamilyTreeNow and platforms alike will remove the entry. When the entry is removed at the source, the search engine will detect that link as removed and remove it from the search engine rendering that result unsearchable. The removed page will no longer show up in the results; however, the user’s PII can still show up in the future.

A search engine can index a user’s removed data. For example, if a platform like FamilyTreeNow honors an opt-out request and removes a profile, new PII in the future can show up on the platform and be indexed by the search engine. Platforms like FamilyTreeNow and Whitepages can rediscover PII and display the data under a new profile. The user would then have to find the new profile and follow the opt-out procedure again.

Summary

To summarize, the U.S. does not have a blanket privacy law prohibiting mining PII and selling it for profit. Organizations that do fall within certain sectors in the U.S., such as financial companies and healthcare organizations, must abide by certain standards, such as FFIEC and HIPAA Security and Privacy rules that prohibit mining PII. Companies that do engage in data mining practices may sell completed data packages or purchase mined data packages for further knowledge about consumers. For example, data mining companies may mine specific PII about a certain group of users and form it into a complete data package. Another company may purchase a completed data package to target advertisements toward that specific group of users. The U.S. does not have a law to prohibit such behavior.

There are recommended publications and guidelines for certain industry organizations to follow. For example, banking entities must maintain compliance with FFIEC. Other organizations, such as hospitals, doctor offices, and medical billing companies, must maintain HIPAA compliance, which includes encryption and other cybersecurity industry-recommended security controls. Many organizations and public directory platforms are not required by federal law to implement security controls or display warning banners and privacy policies to their customers or users. Many popular search engines and some public directory platforms provide methods to remove data from their databases. Removing PII from the internet can be challenging but is not impossible.

Conclusion

The purpose of this research was to describe the process of data mining, provide examples of U.S. organizations that engage in data mining practices, provide methods to remove PII from the internet, and propose privacy preservation techniques to enhance security when handling PII. Throughout this study, the research highlighted consumer unawareness of privacy practices related to PII as well as highlighting the general lack of legislation regarding PII in the private sector and data mining practices. The research examined legislation that is currently in place for government, financial, and health sectors, which includes the HIPAA and the FFIEC guidelines. Additionally, this research went into detail regarding the data mining process: raw data harvest, data integration, data collection, data transformation, pattern analysis, pattern organization, and knowledge presentation. Lastly, organizations that engage in heavy data mining practices were discussed, including social media websites and privately-owned databases, such as Whitepages and FamilyTreeNow.

To summarize, consumer PII is rampant on the internet and most users are unaware of the legislation, or lack thereof, that currently exists for private sector organizations. The research found that, while many users may be aware that their data can be leaked unintentionally, most are not educated in regard to data mining processes. Consumers purchasing items online within the U.S. do not understand the full implications of having their PII collected and shared, which results in a high risk of PII being stolen and often leads to identity theft (Turow, Hennessy & Draper, 2018). To remedy this, it is important not only for the private sector to be federally regulated but for consumers to become more educated on the risks and implications.

For PII to have a chance for safety on the internet, there must first be comprehensive legislation created. The research found that, as it stands, there are loose guidelines for the private sector, but no consequences if these guidelines are ignored. As an example, private sector website databases such as Whitepages and FamilyTreeNow, hold PII on thousands of users without their consent, displaying any information possible to make a complete profile (Waddell, 2017). Anyone with an internet connection can look up any user in the world and wind up with a wealth of information that can be used for their own purposes. As per FamilyTreeNow’s privacy policy, much of the information displayed on the platform comes from purchasing data. While legal limitations exist through the EEOC on how PII found on the internet can be used when evaluating someone for a job, this does not prevent the data found from being used for malicious purposes (Fair Credit Reporting Act of 1970).

Meanwhile, the research found that even legislation, such as HIPAA and FFIEC guidelines alike, does not prevent all disclosure of PII to third parties and, under certain circumstances, could allow accidental data disclosure in the form of data loss (Iguci et al., 2018). The U.S. government is permitted to use PII for research purposes, which could be prone to unintentional data loss. As the internet has evolved, it is important for legislation to evolve along with it. Users and consumers are willing to put much more PII out there to make their lives easier, whether it be to receive newsletters, services, products, or simply create a social media profile to connect with friends and loved ones.

The research concluded that there are U.S. companies that provide opt-out or removal methods to remove sensitive PII from the platform. Public directory services, such as Whitepages, do provide removal methods but a removed user’s PII may wind up on Whitepages in the future. Whitepages and public directory platforms alike can remove a user profile but do not guarantee the user’s PII will be barred from the platform in the future.

Research showed that search engines like Google, Bing, and Baidu do provide methods of data removal for outdated, illegal, PII, or content that is currently pending legal action. The primary method for search engine removal is directly from the search engine where a user would remove the searchable result. The secondary method for search engine removal is content directly from the source, such as Whitepages or FamilyTreeNow. The user would remove their profile directly from the platform and the profile could no longer be indexed by the search engine to display openly.

References

2180.2 CIO Government Services Administration (GSA). Rules of behavior for handling PII. Retrieved from Government Services Administration: https://www.gsa.gov/directive/gsa-rules-of-behavior-for-handling-personally-identifiable-information-(pii)-.

AlKhatib, B., & Basheer, R. (2019). Crawling the dark web: A conceptual perspective,challenges and implementation. Journal of Digital Information Management, 17(2),            51. doi: 10.6025/jdim/2019/17/2/51-60.

Amran, A., Zaaba, Z., & Mahinderjit Singh, M. (2018). Habituation effects in computer security

warning. Information Security Journal: A Global Perspective, 27(4), 192-204.doi:10.1080/19393555.2018.1505008.

Bennett, K. W., & Robertson, J. (2019/05/17). Security in the cloud: Understanding your responsibility. doi:10.1117/12.2521821            

Boerman, S., Kruikemeier, S., & Zuiderveen Borgesius, F. (2017). Online Behavioral Advertising: A Literature Review and Research Agenda. Journal of Advertising, 46(3), 363-376. doi.org/10.1080/00913367.2017.1339368

Bossetta, M. (2018). The digital architectures of social media: Comparing political campaigning on facebook, twitter, instagram, and snapchat in the 2016 U.S. election. Journalism & Mass Communication Quarterly, 95(2), 471-496. doi:10.1177/1077699018763307

Bertino, E., & Ferrari, E. (2017). Big data Security and privacy. Studies in Big Data, 425-439. doi:10.1007/978-3-319-61893-7_25

Bing Content Removal Tool. (2020). from Bing: https://www.bing.com/webmaster/help/bing-content-removal-tool-cb6c294d

U.S. Census Bureau. (2020) Census social media graphics. Retrieved from            U.S. Census Bureau Graphics: https://www.census.gov/partners/2020-materials/social-media-graphics.html

Bush, D. (2016). How data breaches lead to fraud. Network Security, 2016(7), 11-13. doi:10.1016/s1353-4858(16)30069-1

Cappel, J., & Shah, V. (2017). A case study in personal privacy. Issues in Information Systems, 18(3), 62-68. Retrieved from http://www.iacis.org/iis/2017/3_iis_2017_62-68.pdf

Chen, H., Beaudoin, C., & Hong, T. (2016). Securing online privacy: An empirical test on Internet scam victimization, online privacy concerns, and privacy protection behaviors, Computers in Human Behavior. 2016(70), 291-302            doi:10.1016/j.chb.2017.01.003

Council of Economic Advisers. (2018). The Cost of Malicious Cyber Activity to the U.S.              Economy. Washington, DC. Retrieved from              

White House: https://www.whitehouse.gov/wpcontent/uploads/2018/03/The-Cost-of-Malicious-Cyber-Activity-to-the-U.S._Economy.pdf

Crain, M. (2016). The limits of transparency: Data brokers and commodification. SAGE Journals, 20(1), 88-104. doi:10.1177%2F1461444816657096

Chatzimpyrros, M., Solomos, K., & Ioannidis, S. (2020). You shall not register! Detecting privacy leaks across registration forms. Computer Security, 91-104. doi.:10.1007/978-3-030-42051-2_7

“Customer Information and Privacy” (2020). Retrieved from American Bar: https://www.americanbar.org/groups/business_law/migrated/safeselling/privacy

Ducasse, J. (2017). Cyber Kill Chain Methodology, 1(1), 1. Retrieved from Gujara Research Society: http://gujaratresearchsociety.in/index.php/JGRS/article/view/1340/2180.

Facebook World Stats and Penetration in the World. (2020). Retrieved from Internet World Statistics: https://www.internetworldstats.com/facebook.htm

Fair Credit Reporting Act, 15 U.S.C. § 1681

Federal Financial Institutions Examination Council (FFIEC). (2020). Retrieved from FFIEC IT Standards: https://ithandbook.ffiec.gov/it booklets/operations/risk-mitigation-and control implementation/policies,standards,-and-procedures/standards.aspx

Federal Trade Commission. (2020). Protecting Personal Information: A Guide for Business. Retrieved from Federal Trade Commission: https://www.ftc.gov/tips advice/businesscenter/guidance/protecting-personal-information-guide-business

FTC: Your Online Guide to Legal Information and Legal Services in Pennsylvania. Palawhelp.org. (2020). Retrieved from Palawhelp: https://www.palawhelp.org/resource/background-checks-what-job-applicants-and employees-should-know?ref=A7fD3

FamilyTreeNow. Privacy Policy. (2020). Retrieved from FamilyTreeNow: https://www.familytreenow.com/privacy

Funkhouser, K., Malloy, M., Alp, E., Poon, P., & Barford, P. (2018). Device graphing by Example. Proceedings of the 24Th acm sigkdd international conference on knowledge discovery & data mining. Data Mining. doi:10.1145/3219819.3219852

Goodwin, P., & Smith, A. (2020). The State of IT Resilience. Retrieved 30 March 2020, from https://www.zerto.com/wp-content/uploads/2018/08/State-of-IT-Resilience-2018-IDC.pdf

Google. (2020). Remove information from Google.               Retrieved from Google Support: https://support.google.com/webmasters/answer/6332384?hl=en

General Services Administration (GSA). Rules of Behavior for Handling Personally Identifiable Information (PII). (2020). Retrieved from Government Services Administration: https://www.gsa.gov/directive/gsa-rules-of-behavior              for-handling-personally-identifiable information-(pii)-

Guide to Protecting the Confidentiality of Personally Identifiable Information (PII). SP 800-122. Information Technology Laboratory. (2010). Retrieved from National Institute of Standards and Technology: https://csrc.nist.gov/publications/detail/sp/800-122/final.

Guerrini, C., Robinson, J., Petersen, D., & McGuire, A. (2018). Should police have access to genetic genealogy databases? Capturing the golden state killer and other criminals            using a controversial new forensic technique. PLOS Biology, 16(10), e2006906. doi:10.1371/journal.pbio.2006906            .

Han, J., Kamber, M., & Pei, J. (2012). Data mining Concepts and Techniques. ISBN: 0123814804, 9780123814807

Hassan, N. (2019). Gathering evidence from OSINT sources. Digital Forensics Basics, 311 322. doi:10.1007/978-1-4842-3838-7_10

HIPAA Privacy Rule, 45 C.F.R. § 160 (2002)

HIPAA Security Rule, 45 C.F.R. § 160 (2002)

Jackson, J. (2017). Content marketing: Powerful tips and tricks for success in business.            Retrieved from Create Space:            https://dl.acm.org/doi/book/10.5555/3159123

Kosinski, M., Wang, Y., Lakkaraju, H., & Leskovec, J. (2016). Mining big data to extract patterns and predict real-life outcomes. Psychological Methods, 21(4), 493-506. doi:10.1037/met0000105

Kumar, V., & Reinartz, W. (2018). Data mining. Springer Texts in Business And Economics, 135-155. doi:10.1007/978-3-662-55381-7_7

Lang, M. (2017). FamilyTreeNow shreds privacy, but founder remains elusive. Retrieved              from San Francisco Chronicle: https://www.sfchronicle.com/business/article/NewFamilyTreeNow-website-shreds-our privacy-10910345.php

McCallister, E., Grance, T., & Scarfone, K. (2020). Guide to protecting the confidentiality of Personally Identifiable Information (PII). Retrieved from National Institute of Standards and Technology: https://www.dla.mil/Portals/104/Documents/GeneralCounsel/FOIA/Privacy/NIST%20 %20800-122%20Guide%20to%20Protecting%20Confidentiality%20of%20PII.pdf

Miller, R. (2017). The future of data integration. Proceedings of the 23Rd ACM SIGKDD International Conference on Knowledge Discovery And Data Mining – KDD ’17. doi:10.1145/3097983.3105809

NIST Cybersecurity Framework. (2018). Retrieved from National Institute of Standards and Technology (NIST): https://www.nist.gov/cyberframework/framework

Nix, E. (2018). The World’s First Web Site. Retrieved from               History: https://www.history.com/news/the-worlds-first-web-site

NJ Business.Survey reveals americans are more afraid of identity theft than murder. Retrieved from NJ Business Magazine: https://njbmagazine.com/njb-news-now/survey-reveals-americans-are-more-afraid-of-identity-theft-than-murder/

Identity Protection & Data Breach Survey. 4iQ. (2019). Retrieved from 4iQ: https://4iq.com/identity-protection-data-breach-survey

Iguchi, M., Uematsu, T., & Fujii, T. (2018). The anatomy of the HIPAA Privacy Rule: A risk based approach as a remedy for privacy-preserving data sharing. Advances In Information and Computer Security, 174-189. doi:10.1007/978-3-319-979168_12

Internet World Statistics. (2020). Internet growth statistics. Retrieved from Internet World Statistics: https://www.internetworldstats.com/stats.htm.

International Organization for Standardization (ISO)/International Electrotechnical Commission (IEC). (2020). Retrieved from ISO/IEC: https://www.iso.org/isoiec-27001-information-security.html

Privacy Act, Office of Privacy and Open Government, U.S. Department of Commerce. Osec.doc.gov. (2016). Retrieved from U.S Dept of Commerce, https://www.osec.doc.gov/opog/privacyact/privacyact.html

Phillips, C. (2018). The golden state killer investigation and the nascent field of forensic genealogy. Forensic Science International: Genetics, 36, 186-188. doi:10.1016/j.fsigen.2018.07.010

Reynolds, C. (2020). Enterprise encryption use below 30% despite data breaches. Computer Business Review. Retrieved from https://www.cbronline.com/news/enterprise-encryption-use

Salim, S. (2020). Here are the world’s 20 most visited website in July 2019, according to alexa ranking. Retrieved from Digital Information World: https://www.digitalinformationworld.com/2019/07/here-are-worlds-20-most-visited-website.html

Shahani, A. (2018). NPR Choice page. Npr.org. Retrieved 29 March 2020, retrieved from https://www.npr.org/2018/03/19/595018770/facebook-admits-data-mining-firm-got-access-to-millions-of-users-personal-inform

Sobers, R. (2019). “110 must-know cybersecurity statistics for 2020”. Varonis. Retrieved from https://www.varonis.com/blog/cybersecurity-statistics

Solove, D. (2016). A brief history of information privacy law. George Washington University Law School, (215).

Statista. (2020). Search engine market share worldwide 2019. Retrieved from Statista Statistics: https://www.statista.com/statistics/216573/wordwide-market-share-of-search-engines/.

Stagner, L., Ryan, D., Aranda, S., & Turner, M. (2020). Explorations in Data Mining(1st ed.). Construction Management Association of America. Retrieved from https://www.cmaanet.org/sites/default/files/201804/Explorations20in20Data20Minng.pdf

Sutar, S. (2017). Intelligent data mining technique of social media for improving health care. 2017 international conference on intelligent computing and control systems (ICICCS). IEEE. doi::10.1109/iccons.2017.8250690

Torra, V., & Navarro-Arribas, G. (2016). Big data privacy and anonymization. Privacy and identity management. Facing up to next steps, 15-26. https://doi:10.1007/978-3319-55783-0_2

Turow, J., Hennessy, M., & Draper, N. (2018). Persistent misperceptions: Americans’ misplaced confidence in privacy policies, 2003–2015. Journal of Broadcasting & Electronic Media, 62(3), 461-478. doi: 10.1080/08838151.2018.1451867

U.S. Department of Energy. (2020). PII. Retrieved from U.D Dept of Energy, Office of Scientific and Technical Information Program: https://www.osti.gov/stip/pii

Verizon. (2019). 2019 Data Breach Investigations Report. Verizon. Retrieved from, https://enterprise.verizon.com/en-gb/resources/reports/dbir

Vimala Roselin, J., & G.M. Nasira, D. (2020). Secure sensitive data sharing on big data platform. International Journal Of Innovative Research In Advanced Engineering (IJIRAE), 5(3). Retrieved from, http://www.journalcra.com/article/sensitive-data-protection-big-data-using-encryption-algorithm

Waddell, K. (2017). How FamilyTreeNow Makes Stalking Easy. The Atlantic. Retrieved from https://www.theatlantic.com/technology/archive/2017/01/the-webs            many-search-engines-for-your-personal-information/513323

Willson, M., & Leaver, T. (2015). Zynga’s FarmVille, social games, and the ethics of big              data mining. Communication Research and Practice, 1(2), 147-158. doi:10.1080/22041451.2015.1048039

Yershov v. Gannet Satellite Information Network, Inc., d/b/a USA Today (2016)

Zhang, D. (2018). Big data security and privacy protection. Proceedings of the 8Th            International Conference on Management and Computer Science (ICMCS 2018). doi:10.2991/icmcs-18.2018.56

Zhang, Q. (2018). Information system security situation assessment based on data mining. International Journal of Science, 5(8), 122-128. From http://www.ijscience.org/download/IJS-5-8-122-128.pdf

Author profile
Cybersecurity Engineer | Website

Jordan is a Cybersecurity Engineer who has consulted in numerous sectors such as finance, education, manufacturing, and public sector organizations within the United States.