Ruthy Hope Slatis couldn’t believe what she was hearing. She’d been hired by a temp agency outside Boston for a vague job: transcribing audio files for Amazon.com Inc. For $12 an hour, she and her fellow contractors, or “data associates,” listened to snippets of random conversations and jotted down every word on their laptops. Amazon would only say the work was critical to a top-secret speech-recognition product. The clips included recordings of intimate moments inside people’s homes.
This was in fall 2014, right around the time Amazon unveiled the Echo speaker featuring Alexa, its voice-activated virtual-assistant software. Amazon pitched Alexa as a miracle of artificial intelligence in its first Echo ad, in which a family asked for and received news updates, answers to trivia questions, and help with the kids’ homework. But Slatis soon began to grasp the extent to which humans were behind the robotic magic she saw in the commercial. “Oh my God, that’s what I’m working on,” she remembers thinking. Amazon was capturing every voice command in the cloud and relying on data associates like her to train the system. Slatis first figured she’d been listening to paid testers who’d volunteered their vocal patterns in exchange for a few bucks. She realized that couldn’t be.
The recordings she and her co-workers were listening to were often intense, awkward, or intensely awkward. Lonely sounding people confessing intimate secrets and fears: a boy expressing a desire to rape; men hitting on Alexa like a crude version of Joaquin Phoenix in Her. And as the transcription program grew along with Alexa’s popularity, so did the private information revealed in the recordings. Other contractors recall hearing kids share their home address and phone number, a man trying to order sex toys, a dinner party guest wondering aloud whether Amazon was snooping on them at that very instant. “There’s no frickin’ way they knew they were being listened to,” Slatis says. “These people didn’t agree to this.” She quit in 2016.
In the years since Slatis first felt her skin crawl, a quarter of Americans have bought “smart speaker” devices such as the Echo, Google Home (now Nest), and Apple HomePod. (A relative few even bought Facebook’s Portal, an adjacent smart video screen.) Amazon is winning the sales battle so far, reporting that more than 100 million Alexa devices have been purchased. But now a war is playing out between the world’s biggest companies to weave Alexa, Apple’s Siri, Alphabet’s Google Assistant, Microsoft’s Cortana, and Facebook’s equivalent service much deeper into people’s lives. Mics are built into phones, smartwatches, TVs, fridges, SUVs, and everything in between. Consulting firm Juniper Research Ltd. estimated that by 2023 the global annual market for smart speakers will have reached $11 billion, and there will be about 7.4 billion voice-controlled devices in the wild. That’s about one for every person on Earth.
The question is, then what? These machines are not creating audio files of your every decibel—tech companies say their smart speakers record audio only when users activate them—but they are introducing always-on mics to kitchens and bedrooms, which could inadvertently capture sounds users never intended to share. “Having microphones that listen all the time is concerning. We’ve found that users of these devices close their eyes and trust that companies are not going to do anything bad with their recorded data,” says Florian Schaub, a University of Michigan professor who studies human behavior around voice-command software. “There’s this creeping erosion of privacy that just keeps going and going. People don’t know how to protect themselves.”
Amazon declined interview requests for this story. In an emailed statement, a spokeswoman wrote, “Privacy is foundational to how every team and employee designs and develops Alexa features and Echo devices. All Alexa employees are trained on customer data handling as part of our security training.” The company and its competitors have said computers perform the vast majority of voice requests without human review.
Yet so-called smart devices inarguably depend on thousands of low-paid humans who annotate sound snippets so tech companies can upgrade their electronic ears; our faintest whispers have become one of their most valuable datasets. In 2019, Bloomberg News was first to report on the scope of the technology industry’s use of humans to review audio collected from their users without disclosures, including at Apple, Amazon, and Facebook. Few executives and engineers who spoke with Bloomberg Businessweek for this story say they anticipated that setting up vast networks of human listeners would be problematic or intrusive. To them, it was and is simply an obvious way to improve their products.
Current and former contractors such as Slatis make clear that the downsides of pervasive audio surveillance were obvious to those with much less financial upside at stake. “It never felt right,” says a voice transcriber for an Alexa rival who, like most of the contractors, signed a nondisclosure agreement and spoke on condition of anonymity for fear of reprisals. “What are they really selling to customers?”
Nerds have imagined voice commands to be the future of computing for more than a half-century. (Thank Star Trek.) But for most of that time, teaching machines to identify and respond to spoken sentences required matching audio files verbatim to transcribed text, a slow and expensive process. Early pioneers bought or built massive libraries of recordings—people reading newspapers or other prewritten material into mics. The Sisyphean nature of the projects eventually became an industry joke. In the 1990s, a former product manager on the speech team at Apple Inc. recalls, it offered each volunteer willing to record voice patterns at their lab a T-shirt emblazoned with the phrase “I Helped Apple Wreck a Nice Beach,” a computer’s garble of “recognize speech.”
Apple, which declined to comment for this story, became the first major company to flip the model in 2011, when it shipped the iPhone 4S with Siri, acquired the year before from a Pentagon-funded research spinoff. No longer did recordings have to be scripted and amassed in labs. Apple sold more than 4 million 4S phones within days, and soon began piling up an incalculable mountain of free, natural voice data. For the first few years, the company largely trusted outside speech-software specialists to use the data to improve Siri’s abilities, but Apple retook control around 2014. “The work was very tedious: After listening for 15 or 30 minutes, you’d get headaches,” Tao Ma, a former senior Siri speech scientist, says of transcribing user recordings. The in-house team farmed out much of this work to IT contractors in Europe, including Ireland-based GlobeTech.
Over the past few years, Apple has grown more aggressive in its harvesting and analysis of people’s voices, worried that Siri’s comprehension and speed were falling behind those of Alexa and Google Assistant. Apple treated Siri’s development like a verbal search engine that it had to prep to fulfill endless user queries and ramped up its dependence on audio analysis to feed the assistant’s lexicon. Temps were expected to account for the clips’ various languages, dialects, and cultural idiosyncrasies.
Former contractors describe the system as something out of the Tower of Babel or George Orwell’s 1984. At a GlobeTech office near an airport in Cork, Ireland, some say, they sat in silence at MacBooks wearing headphones, tasked with transcribing 1,300 clips a day, each of which could be a single sentence or an entire conversation. (This quota was reduced from as many as 2,500 clips, others say, to improve accuracy rates.) When a contractor clicked play on a voice recording, the computer filled a text box with the words it thought Siri “heard,” then prompted the worker to approve or correct the translation and move on. GlobeTech didn’t respond to requests for comment.
A program the workers used, called CrowdCollect, included buttons to skip recordings for a variety of reasons—accidental trigger, missing audio, wrong language—but contractors say there was no specific mechanism to report or delete offensive or inappropriate audio, such as drunk-sounding users slurring demands into the mics or people dictating sexts. Contractors who asked managers whether they could skip overly private clips were told no clips were too private. They were expected to transcribe anything that came in. Contractors often lasted only a couple of months, and training on privacy issues was minimal. One former contractor who had no qualms about the work says listening in on real-world users was “absolutely hilarious.”
In 2015, the same year Apple Chief Executive Officer Tim Cook called privacy a “fundamental human right,” Apple’s machines were processing more than a billion requests a week. By then, users could turn on a feature so they no longer had to push a button on the iPhone to activate the voice assistant; it was always listening. Deep in its user agreement legalese, Apple said voice data might be recorded and analyzed to improve Siri, but nowhere did it mention that fellow humans might listen. “I felt extremely uncomfortable overhearing people,” says one of the former contractors, especially given how often the recordings were of children.
Ten former Apple executives in the Siri division say they didn’t and still don’t see this system as a violation of privacy. These former executives say recordings were disassociated from Apple user IDs, and they assumed users understood the company was processing their audio clips, so what did it matter if humans helped with the processing? “We felt emotionally safe, that this was the right thing to do,” says John Burkey, who worked in Siri’s advanced development group until 2016. “It wasn’t spying. It was, ‘This [Siri request] doesn’t work. Let’s fix it.’ It’s the same as when an app crashes and asks if you want to send the report to Apple. This is just a voice bug.”
The difference between this system and a bug on a MacBook, of course, is that MacOS clearly asks users if they’d like to submit a report directly after a program crashes. It’s an opt-in prompt for each malfunction, as opposed to Siri’s blanket consent. Current and former contractors say most Siri requests are banal—“play a Justin Bieber song,” “where’s the nearest McDonald’s”—but they also recall hearing extremely graphic messages and lengthy racist or homophobic rants. A former data analyst who worked on Siri transcriptions for several years says workers in Cork swapped horror stories during smoke breaks. A current analyst, asked to recount the most outrageous clip to come through CrowdCollect, says it was akin to a scene from Fifty Shades of Grey.
Apple has said less than 0.2% of Siri requests undergo human analysis, and former managers dismiss the contractors’ accounts as overemphases on mere rounding errors. “ ‘Oh, I heard someone having sex’ or whatever. You also hear people farting and sneezing—there’s all kind of noise out there when you turn a microphone on,” says Tom Gruber, a Siri co-founder who led its advanced development group through 2018. “It’s not like the machine has an intention to record people making certain kinds of sounds. It’s like a statistical fluke.”
By 2019, after Apple brought Siri to products such as its wireless headphones and HomePod speaker, it was processing 15 billion voice commands a month; 0.2% of 15 billion is still 30 million potential flukes a month, or 360 million a year. The risks of inadvertent recording grew along with the use cases, says Mike Bastian, a former principal research scientist on the Siri team who left Apple in 2019. He cites the Apple Watch’s “raise to speak” feature, which automatically activates Siri when it detects a wearer’s wrist being lifted, as especially dicey. “There was a high false positive rate,” he says.
In the smart speaker business, Apple’s HomePod is estimated to account for only 5% of the U.S. market. Amazon has an estimated 70%. In 2011 CEO and massive Star Trek fan Jeff Bezos ordered a team that showed him an early voice-controlled music app to build the software into a hardware product. They produced the Echo, with its seven microphones constantly listening for a “wake word” that will trigger a fresh recording. Each clip, as with Apple’s, goes to the company’s servers, where a portion of them are then routed to one of hundreds of data associates for review.
Bezos and David Limp, Amazon’s senior vice president for devices, weren’t blind to the creep factor. They made design choices aimed at keeping Echo users from freaking out about being recorded, says an early Alexa product manager. When a user says “Alexa,” a ring of light appears around the Echo, as though the assistant were coming to life. A dedicated “personality team” scripted jokey answers to hundreds of frequently asked questions. And developers created an online portal where users could play and delete their audio clips. An Amazon spokeswoman says privacy standards were built into Alexa from the start.
The fine print grants Amazon the right to retain and experiment on its voice clips far beyond what Apple does with Siri. By default, the company retains recordings indefinitely. Amazon discloses few specifics on how this data is used, except to say its human transcriptions have proved an enormous advantage in translating Alexa into new languages around the world and expanding its response capabilities.
In 2016, Amazon created the Frequent Utterance Database, or FUD, to help Alexa add answers to common requests. Former employees who worked with FUD say there was tension between product teams eager to mine the data more aggressively and the security team charged with protecting user info, such as phone numbers that could easily identify a given customer. In 2017, Amazon introduced the camera-equipped Echo Look, which was pitched as an AI stylist that could recommend outfit pairings. Its developers considered programming the camera to switch on automatically when a user asked Alexa to make a joke, say people familiar with the matter. The idea was to record a video of the user’s face and assess whether she was laughing. Amazon ultimately shelved the idea, these people say. Amazon says Alexa doesn’t use facial recognition technology today.
The company has set up transcription farms in cities around the world, from Bucharest to Chennai. Several times, it’s held walk-in recruiting events for transcribers overseas. A speech technologist who’s spent decades developing recognition systems for tech companies says the scale of Amazon’s audio data analysis as outlined in a recruiting effort was terrifying. Amazon says it takes the “security of customers and their voice recordings seriously,” and that it needs a complete understanding of regional accents and colloquialisms to make Alexa global.
In August 2019, Microsoft acknowledged that humans help review voice data generated through its speech-recognition technology—in products including its Cortana assistant and Skype messaging app—which businesses such as BMW, HP Inc., and Humana are integrating into their own products and services. Chinese tech companies including marketplace Alibaba, search giant Baidu, and phone maker Xiaomi are churning out millions of smart speakers each quarter. Industry analysts say Google and Facebook Inc. are likewise betting audio data will greatly enhance their mammoth ad businesses. Internet browsing tells these companies a tremendous amount about people, but audio recordings could make it much easier for AI to approximate ages, genders, emotions, and even locations and interests, says Schaub, the University of Michigan professor. “People often don’t realize what their voice commands reveal,” he says. “If you’re asking about football a lot, you’re likely an NFL fan. If a baby is crying in the background, they can infer you have a family.”
Google Assistant feeds its namesake search engine with queries from a billion devices it’s available on, including Android smartphones and tablets, Nest thermostats, and Sony TVs. Google, which has hired temp workers overseas to transcribe clips to improve the system’s accuracy, has promised that reviewed voice recordings aren’t linked to any personal information. But this summer a Google contractor shared more than 1,000 user recordings with Belgian broadcaster VRT NWS. The outlet was able to figure out who some of the people in the recordings were based on things they said, to the shock of those identified. Roughly 10% of the leaked clips were also recorded without these users’ consent, because of devices erroneously detecting the activation phrase “OK, Google.”
A Google spokeswoman says, “Since hearing concerns, we have been committed to pausing this human transcription of Assistant audio while we enhance our privacy controls.” The company declined to comment on whether humans transcribe voice data collected from other Google services. A senior engineer involved with Google Assistant who left the company says people would overlook concerns about snooping if voice assistants, including Google’s, were more useful.
Facebook, where data privacy scandals have become routine, drew scoffs when it introduced Portal, a combination smart speaker and videophone, in November 2018. The company had wanted to hold off on releasing Portal until the heat from its Cambridge Analytica debacle had died down, but it wound up unveiling the device, which includes a built-in microphone and camera, soon after a different shocking data leak. Incredibly, Facebook billed the Portal as a privacy-centric project, promising that any stored mic or camera data would be kept on the device and off the cloud. Who wouldn’t want a Facebook camera tracking them around the living room as they walk and talk? Besides CEO Mark Zuckerberg, who keeps his laptop’s mic and camera covered and nonfunctional.
At one point or another, pretty much every Facebook user has heard the rumor that the company sharpens its ad targeting by secretly listening to people through the mics in their phones or other devices. When Congress called him to testify lin 2018, Zuckerberg labeled that concern a “conspiracy theory.” Yet Facebook, too, has been relying on transcribed recordings to train its AI, and not just with audio from its users. In one instance, a contractor hired through Accenture Plc was instructed to use her personal Facebook account to call friends and family to create new audio, without telling them why. (She says this caused her anxiety.) A source within Facebook confirms the commands were recorded, but the company says it never instructed the actual calls to be captured, saying it’s “not something we would ever direct to be done.” Accenture referred a request for comment to Facebook.
Facebook has also relied on human transcribers for its chat app Messenger, which allows users to exchange audio clips instead of texting. The company prompted users with an option to have its AI auto-transcribe these voice messages but didn’t tell them these clips also went to contractor TaskUs Inc. for manual review. Facebook didn’t inform the TaskUs workers where the audio clips came from, so they assumed Facebook was using exactly the kind of surveillance dragnet Zuckerberg had told Congress didn’t exist. It didn’t help that TaskUs referred to its Facebook contract internally as “Prism,” the same code name used for a National Security Agency spying program revealed in 2013 by whistleblower Edward Snowden.
Along with separating voice files from user IDs the way Apple does, Facebook’s software slightly alters each person’s vocal pitch before relaying the files to contractors, says Andrew Bosworth, the vice president who oversees Facebook’s hardware division. He acknowledges that using voice command and video chat tools should require “a lot of faith in the technology distributors behind those tools” but says he trusts Google and Amazon, as well as his own company, to use voice data to improve their services rather than take advantage of sensitive information in the clips. His home in San Mateo, Calif., is sprinkled with three Portals and four other devices that use either Alexa or Google Assistant, including in his kitchen and his kids’ playroom.
Several of the big tech companies tweaked their virtual-assistant programs after a steady drip of news reports. While Google paused human transcriptions of Assistant audio, Apple began letting users delete their Siri history and opt out of sharing more, made sharing recordings optional, and hired many former contractors directly to increase its control over human listening. Facebook and Microsoft added clearer disclaimers to their privacy policies. And Amazon introduced a similar disclosure and started letting Alexa users opt out of manual reviews. “It’s a well-known thing in the industry,” Amazon’s Limp recently said about human transcription teams. “Whether it was well known among press or customers, it’s pretty clear we weren’t good enough there.”
It’s easy to fathom how an authoritarian government or unscrupulous three-letter agency could take advantage of these ubiquitous surveillance networks. The U.S. House of Representatives is considering legislation to curb automated eavesdropping by digital assistants, and a bipartisan group of senators has called for the Federal Trade Commission to investigate Amazon’s recordings of children, but all the relevant authorities are moving slowly. “Are users aware this processing is happening? If not, they need to be,” says Dale Sunderland, deputy commissioner of Ireland’s Data Protection Commission, which supervises tech companies’ compliance with European Union privacy rules and is reviewing the industry’s audio collection practices. “We want these companies to demonstrate to us how they’ve built in necessary safeguards.” A Pew Research Center survey estimated that most Americans are concerned about the data collection practices of smart speakers and similar listening devices. Still, adoption rates keep rising.
Some researchers say advances in smartphone processing power and a form of computer modeling called federated learning may eventually render this kind of eavesdropping obsolete—that the machines will get smart enough to figure out things without help from the contractors. For now, absent tougher laws or consumer backlash, the ranks of human audio reviewers will almost certainly continue growing to keep pace as listening devices proliferate.
Many former contractors say they’ve stopped using virtual assistants and unplugged their listening devices. The audio sexts were awkward and all, but some are more haunted by the idea that people are listening even to the most quotidian of conversations, like a father chatting with his son after school, or a husband and wife talking in the kitchen after work. “In my head, I would say, I shouldn’t be listening to this, ” says a former contractor who spent months working on Siri transcriptions. “This is none of my business.” —With Mark Bergen, Gerrit De Vynck, Natalia Drozdiak, and Giles Turner