A Closer Look at Data Brokers’ Sources of Data

One of my more popular blog posts is “A Look at the Different Types of Data Brokers.” As a complement to that blog post, I am posting an excerpt from my book Containing Big Tech (which you can pre-order today!) that goes into detail on the sources of data that feed into data brokers’ massive databases of our personal information. I hope you enjoy this excerpt, which is very topical given the proposed California Delete Act (SB 362) and the proposed federal DELETE Act.

Where Does the Data Come From?

The entire data broker industry is built on a lack of transparency. It starts with users of websites, mobile apps, and internet-connected devices and services are mainly unaware of how our data is collected, how it is used, and if it is subsequently sold or shared with data brokers.[1]

Furthermore, even though data brokers collect information about consumers from government, commercial, publicly available sources, and web and mobile tracking, they also use other data brokers as significant sources of data. Per the Federal Trade Commission (FTC), the result is that it is “virtually impossible” to trace how a data broker obtained your data.[2]

And finally, as the FTC has further noted, this lack of transparency extends to the fact that consumers are fairly oblivious to the existence of data brokers as they don’t directly interact with them. And even if you identify a data broker who may have your data, they are generally reluctant to reveal their data sources. But let’s crack open the window a little and look at their data sources.[3]

Government Sources

Data brokers collect a lot of government-generated data. For instance, at the federal level, the US Census provides demographics at the city block level, including age, ethnicity, income, education level, and occupations. The Social Security Administration provides its Death Master File, which provides names and dates of deaths. The US Postal Service produces data regarding address changes. And there is also data from federal bankruptcy proceedings. All of which are consumed by data brokers.[4]

There is also a lot of data available from state and local governments, including property records, court filings, criminal convictions, professional (e.g., hairdresser) and recreational (e.g., fishing) licenses, marriage licenses, divorce records, birth certificates, bankruptcy records, voter registration information, and vehicle registration records. And some of this is directly purchased from government agencies. For example, Vice reported in 2019 that various state departments of motor vehicles were selling names, addresses, and car registration information to credit reporting agencies and firms engaged in background checks.[5]

To give you a feel for the vast amount of data collected from government sources, Lexis-Nexis claims data from 1.5 billion bankruptcy records, 6.5 billion personal property records, and 6.6 billion motor vehicle registrations. And it says it has data from over 37 billion public records involving criminal investigations. And another data broker, CoreLogic, says it has data on “more than 99.99% of all properties in the United States,” including over 1 billion property records.[6]

Commercial Sources

Data brokers’ commercial data sources include purchase history, warranty registration, credit information, and loyalty card data. For example, one data broker, Nielsen, claims it collects 80% of all credit card transactions, 30% of all debit card transactions, purchase history across over 90 million households, and purchase data from 18,000 retailers. Equifax advertises it has paystub data on more than half the US workforce. In 2016, Spotify announced a deal with advertising and marketing service company WPP that gave WPP access to the “unique listening preferences and behaviors” of Spotify users, thereby letting advertisers get a better sense of consumers’ moods and activities. Datalogix, a data broker firm that Oracle acquired, claimed that it collected data representing over $1 trillion in consumer spending.[7]

Publicly Available Sources

Data brokers collect from publicly available social media profiles, forum posts, media reports, business listings, and telephone books. Some of this data is only available in hard-copy format, and data brokers will scan these records into a digital format for uploading. But in most cases, data brokers utilize “web crawlers” or “scraper bots” that parse through web pages to extract data. Think of them as a search engine’s web crawler but solely focused on the hunt for your personal information, with a significant focus on public profiles and postings from social media sites. For example, it was reported in 2017 that data broker Oracle aggregated 3 billion profiles from 15 million websites while collecting over 700 million messages from social media networks, blogs, and consumer review sites daily.[8]

Web Tracking

The most significant source of personal data that data brokers collect is online tracking. Nearly all websites, mobile apps, and IoT device providers actively share behavioral data with other companies. According to the Electronic Frontier Foundation (EFF), the average web page shares data with “dozens” of third parties, as does the average mobile app — including location data when the app is not even in use — and is “largely invisible” to the average consumer.[9]

On websites, tracking is facilitated through cookies. A cookie is a small file that stores user data that identifies a specific visitor to a domain (i.e., website) when you visit a website from a browser. “First-party” cookies are tied to a particular domain and will remember your past activity on that site. For example, it may recognize you had an item in a shopping cart, if you had previously logged in (so you don’t have to re-login), and other settings that personalize the experience on the website. In effect, first-party cookies can help improve your direct relationship with the business' website.[10]

But in many cases, businesses also configure their websites to have “third-party cookies.” Third-party cookies are used by online ad networks that act as aggregators of ad inventory from website publishers. Ad networks facilitate the sale of this inventory to advertisers. To enable this, ad networks provide software code that publishers embed into their websites to facilitate ad campaign delivery and tracking. So, each time a user visits a website with this code, a cookie is downloaded that tracks the user’s online activity and behavior across other websites that also have this code. That is why, for example, you can be browsing for red shoes on website A, and then later that day, you start seeing an ad for the same shoes on website B — an online advertising tactic known as “retargeting.” [11]

In addition, a website publisher typically works with multiple ad networks, so multiple third-party cookies are downloaded when a user visits a given website. For example, if you visit The Huffington Post website, it will install over 20 tracking cookies, including those from Google, Meta, Amazon, etc.

So over time, enough data is collected from online activity — e.g., entering addresses and phone numbers in online forms, what is purchased, etc. — that the cookies often can be linked to real people. Moreover, this online activity can be sold to or shared with data brokers.

Ad networks also participate in ad exchanges — such as Google’s DoubleClick, the dominant player in this market — that further promulgate access to third-party cookies and corresponding user activity. An ad exchange is a cloud service that facilitates the buying and selling of online advertising inventory from multiple ad networks. Pricing is set based on the real-time bidding of serving an ad to a given user, and the determination of the user is based on the user’s cookie. The ad exchanges will share the third-party cookies with advertisers as part of the bidding process. The advertisers, in turn, leverage data management platforms — that tie in data from data brokers to better link a cookie to an actual user — that let the advertiser make a more well-informed bid.[12]

Mobile Tracking

Unsurprisingly, tracking also occurs with mobile apps. Instead of cookies, a Mobile Advertising ID (MAID) is created by Apple or Google (depending on if you have iPhone or Android phone) that gets assigned to each phone that acts as the identifier. And the library of code provided by third parties, such as ad networks, is in the form of Software Developer Kits (SDKs) embedded into the mobile app. So, whenever a user opens and uses an app that uses the SDK, the mobile app makes requests to the third party’s servers. Unlike on the web, where browsers can distinguish between first-party and third-party cookies (and you can configure a browser to block third-party cookies), in the case of mobile apps, if you grant permissions to the app — e.g., access your camera or location — then any third-party code embedded in the app gets those same permissions. And because the request back to the third party’s servers contains the Advertising ID, the user can now be profiled across multiple apps.[13]

Mobile apps can transmit sensitive personal information such as health status. For example, researchers discovered that healthcare apps such as the Drugs.com Medication Guide sent data to over 100 outside entities, including device identifiers and queried terms such as “herpes,” “HIV,” “Adderall,” “diabetes,” and “pregnancy.” Unfortunately, consumers have little recourse as HIPAA does not apply to mobile apps unaffiliated with your doctor, hospital, or insurance carrier.[14]

In addition, mobile apps often send their location data to backend servers on the internet. This can be used to pinpoint people’s precise movements. Not surprisingly, location data is of great interest to advertisers. For example, according to The Markup, Burger King ran “a promotion in which, if a customer’s phone was within 600 feet of a McDonalds, the Burger King app would let the user buy a Whopper for one cent.” Another use case of location data is measuring and analyzing foot traffic to specific stores or buildings. Some developers of mobile apps add these SDKs not to advertise inside their apps but to send users’ location data to data brokers in exchange for payments for the data.[15]

Summary

In summary, Big Tech’s advertising ecosystem enables data brokers to collect online activity — for both web and mobile — and to be used as a data source to facilitate online advertising. But one significant ray of sunshine is that Apple’s App Tracking Transparency (ATT) has put a monkey wrench into this data collection on the iPhone, with 75% of Apple iOS users opting out of third-party tracking. This has not only impacted the amount of data that data brokers collect but also blocks trackers from Big Tech firms such as Meta. To work around ATT, it has been reported that data brokers are increasingly collecting location data directly from mobile app developers to avoid the “digital footprint” of relying on an SDK that Apple would detect when reviewing an app submission.[16]

If you like this content (which is a few pages from my chapter on Data Brokers), a reminder that can you pre-order my book Containing Big Tech today!



[1] NATO StratCom COE, “Data Brokers and Security,” 2021, https://stratcomcoe.org/cuploads/pfiles/data_brokers_and_security_20-01-2020.pdf.

[2] Federal Trade Commission, “Data Brokers: A Call for Transparency and Accountability,” 2014, https://www.ftc.gov/system/files/documents/reports/data-brokers-call-transparency-accountability-report-federal-trade-commission-may-2014/140527databrokerreport.pdf.

[3] Ibid.

[4] Ibid.

[5] NATO StratCom COE, “Data Brokers and Security,” 2021. Federal Trade Commission, “Data Brokers: A Call for Transparency and Accountability,” 2014. Joseph Cox, “The California DMV Is Making $50M a Year Selling Drivers’ Personal Information,” Vice, November 25, 2019, https://www.vice.com/en/article/evjekz/the-california-dmv-is-making-dollar50m-a-year-selling-drivers-personal-information. Atlas Privacy, “Does Starbucks Know If I Wet the Bed,” February 9, 2022, https://atlasprivacy.medium.com/does-starbucks-know-if-i-wet-the-bed-37a7d9a9487f.

[6] Justin Sherman, “Data Brokers and Sensitive Data on US Individuals,” Duke University, Sanford Cyber Policy Program, 2021.

[7] Justin Sherman, “Data Brokers and Sensitive Data on US Individuals,” Duke University, Sanford Cyber Policy Program, 2021, https://techpolicy.sanford.duke.edu/wp-content/uploads/sites/4/2021/08/Data-Brokers-and-Sensitive-Data-on-US-Individuals-Sherman-2021.pdf. Chris Chmura, “Your Pay Stub is Probably in the Cloud,” NBC Bay Area, May 6, 2022, https://www.nbcbayarea.com/investigations/consumer/your-pay-stub-is-probably-in-the-cloud-silicon-valley-startup-recommends-a-vault-instead/2883933/. WPP, “WPP's Data Alliance and Spotify announce global data partnership,” November 15, 2016, https://www.prnewswire.com/news-releases/wpps-data-alliance-and-spotify-announce-global-data-partnership-300362733.html. Lois Beckett, “Everything We Know About What Data Brokers Know About You,” ProPublica, June 13, 2014, https://www.propublica.org/article/everything-we-know-about-what-data-brokers-know-about-you.

[8] Federal Trade Commission, “Data Brokers: A Call for Transparency and Accountability,” 2014. Robots.net, “How Data Brokers Profit from Your Data,” February 22, 2022, https://robots.net/fintech/general-fintech/how-data-brokers-profit-from-your-data-a-guide/. Wolfie Christl, “Corporate Surveillance in Everyday Life,” Cracked Labs, June 2017, https://crackedlabs.org/en/corporate-surveillance.

[9] Wolfie Christl, “Corporate Surveillance in Everyday Life,” Cracked Labs, June 2017. Bennett Cyphers and Gennie Gebhart, “Behind the One Way Mirror: A Deep Dive into the Technology of Corporate Surveillance,” Electronic Frontier Foundation (EFF), December 2, 2019, https://www.eff.org/wp/behind-the-one-way-mirror.

[10] Molly McGuane, “First-Party Cookies vs. Third-Party Cookies (Biggest Differences),” Terakeet, February 4, 2021, https://terakeet.com/blog/first-party-cookies-vs-third-party-cookies/.

[11] Ibid.

[12] Bennett Cyphers and Gennie Gebhart, “Behind the One Way Mirror: A Deep Dive into the Technology of Corporate Surveillance,” Electronic Frontier Foundation (EFF), December 2, 2019.

[13] Ibid.

[14] Tatum Hunter and Jeremy Merrill, "Health apps share your concerns with advertisers. HIPAA can't stop it," Washington Post, September 22, 2022, https://www.washingtonpost.com/technology/2022/09/22/health-apps-privacy/.

[15] Jon Keegan and Alfred Ng,There’s a Multibillion-Dollar Market for Your Phone’s Location Data,” The Markup, September 30, 2021. Joseph Cox, “Data Broker Is Selling Location Data of People Who Visit Abortion Clinics,” Vice, May 2, 2022.

[16] Jon Keegan and Alfred Ng, “Who is Policing the Location Data Industry?” The Markup, February 24, 2022, https://themarkup.org/ask-the-markup/2022/02/24/who-is-policing-the-location-data-industry.

Previous
Previous

Common Threads Connecting Data Brokers and Privacy & Security Risks

Next
Next

“Oops! I Did It Again” ... Meta Pixel Still Hoovering Up Our Sensitive Data