Hello, Computer: Inside Apple's Voice Control

This year’s Worldwide Developers Conference was big. From dark mode in iOS 13 to the newly-rechristened iPadOS to the unveiling of the born-again Mac Pro and more, Apple’s annual week-long bonanza of all things software was arguably one of the most anticipated and exciting events in recent Apple history.

Accessibility certainly contributed to the bigness as well. Every year Apple moves mountains to ensure accessibility’s presence is felt not only in the software it previews, but also in the sessions, labs, and other social gatherings in and around the San Jose Convention Center.

“One of the things that’s been really cool this year is the [accessibility] team has been firing on [all] cylinders across the board,” Sarah Herrlinger, Apple’s Director of Global Accessibility Policy & Initiatives, said to me following the keynote. “There’s something in each operating system and things for a lot of different types of use cases.”

One announcement that unquestionably garnered some of the biggest buzz during the conference was Voice Control. Available on macOS Catalina and iOS 13, Voice Control is a method of interacting with one’s Mac or iOS device using only your voice. A collaborative effort between Apple’s Accessibility Engineering and Siri groups, Voice Control aims to revolutionize the way users with certain physical motor conditions access their devices. At a high level, it’s very much a realization of the kind of ambient, voice-first computing dreamed up by sci-fi television stalwarts like The Jetsons and Star Trek decades ago. You talk, it responds.

And Apple could not be more excited about it.

“I Had Friggin’ Tears in My Eyes”

The excitement for Voice Control at WWDC was palpable. The sense of unbridled pride and joy I got from talking to people involved in the project was unlike anything I’d seen before. The company’s ethos to innovate and enrich people’s lives is a boilerplate talking point at every media event. But to hear engineers and executives like Herrlinger gush over Voice Control was something else: it was emotional.

Nothing captures this better than the anecdote Craig Federighi gave at John Gruber’s live episode of his podcast, The Talk Show. During the segment on Voice Control, Federighi recounted a story about an internal demo of the feature he saw from members of Apple’s accessibility team during a meeting. The demonstration went so well, he said, that he almost started to cry while backstage.

“It’s one of those technologies…you see it used and not only are you amazed by it, but you realize what it could mean to so many people,” Federighi said to Gruber. “Thinking about the passion [of] members of the Accessibility team and the Siri team and everyone who pulled that together is awesome. It’s some of the most touching work we do.”

Federighi’s account completely jives with the sentiment around WWDC. Everyone I spoke to – be it fellow reporters, attendees, or Apple employees – expressed the same level of enthusiasm for Voice Control. The consensus was it is so great. I’ve heard the engineering and development process for Voice Control was quite the undertaking for workers in Cupertino. It took, as mentioned at the outset, a massive, cross-functional collaborative effort to pull this feature together.

In a broader scope, the emotion behind seeing Voice Control come to fruition lies not only in the technology itself, but in its reveal too.

That Apple chose to devote precious slide space, as well as a good chunk of stage time, to talk up Voice Control is highly significant. Like with the decision to move the Accessibility menu to the front page of Settings in iOS 13, the symbolism is important. Apple has spent time talking about accessibility at various events over the last several years, and for them to do so again in 2019 serves as yet another poignant reminder the company cares deeply for the disabled community. It is a big deal that Apple highlights accessibility alongside the other marquee, mainstream features at a place that’s the biggest event in the Apple universe every summer.

“My success is completely determined by the technology I have available to me,” said Ian Mackay, who became a quadriplegic as the result of a cycling accident and who starred in Apple’s Voice Control video shown at WWDC. “Knowing that accessibility is important enough for Apple to highlight at a huge event like WWDC reaffirms to me that Apple is interested and engaged in furthering and enhancing these technologies that give me independence.”

A Brief History of Voice Control

The Voice Control feature we know today has lineage in Apple history. One of the banner features of the iPhone 3GS,¹ released in 2009, was Voice Control.²

The differences are vast. The version that shipped ten years ago was rudimentary, replete with a rudimentary-sounding voice. At the time, Apple touted Voice Control for its ability to allow users “hands-free operation” of their iPhone 3GS; Phil Schiller talked up the “freedom of voice control” in the press release. The functionality was bare-bones: you could make calls, control music playback, and ask what’s playing. In the voice computing timeline, it was prehistoric technology.³ Of course, Voice Control’s launch with the iPhone 3GS in June 2009 pre-dated Siri by over two years. Siri wouldn’t debut until October 2011, with the iPhone 4S.

The Voice Control of 2019, by contrast, is a supercomputer. Making phone calls and controlling music playback is par for the course nowadays. With this Voice Control, you quite literally tell your computer to wake up and do things like zoom in and out of photos, scroll, drag and drop content, drop a pin on a map, use emoji – even learn a new vocabulary.

When talking about emerging markets or new technologies, Tim Cook likes to say they’re in the “early innings.” Voice-first computing surely is in that category. But once you compare it to where it was a decade ago, the progress made is astoundingly obvious. The Voice Control that will ship as part of macOS Catalina and iOS 13 is light years ahead of its ancestor; it’s so much more sophisticated that it’s exciting to wonder how the rest of the voice-first game is going to play out.

Voice Control’s Target Audience

The official reason Apple created Voice Control is to provide yet another tool with which people with certain upper body disabilities can access their devices.

Voice Control shares many conceptual similarities with the longstanding Switch Control feature, first introduced six years ago with iOS 7. Both enable users who can’t physically work with a mouse or touchscreen to manipulate their devices with the same fluidity as those traditional input devices. They are clearly differentiated, however, largely in their respective interaction models. Where Switch Control relies solely on switches to navigate a UI, Voice Control ups the ante by doing so using only the sound of your voice.

There is also opportunity for Voice Control to have relevance beyond the original intended use case. It might find appeal to people with RSI issues, as using one’s voice to control your machine would alleviate pain and fatigue associated with using a keyboard and pointing device. Likewise, others might simply find it fun to try Voice Control for the futuristic feeling of telling their computer to do stuff and watching them respond accordingly. Either way, it’s good that accessibility get more mainstream exposure.

As Mackay told me in our interview: “I feel Sarah Herrlinger said it best when she said, ‘When you build for the margins, you actually make a better product for the masses.’ I’m really excited to see how those with and without disabilities utilize this new technology.”

How Voice Control Works

The essence of Voice Control is this: you tell the computer what to do and it does it.

Apple describes Voice Control as a “breakthrough new feature that gives you full control of your devices with just your voice.” The possibilities for what you can do are virtually endless. Pretty much any task you might throw at your MacBook Air or iPad Pro, chances are good Voice Control will be able to handle it.

There is somewhat of a learning curve, insofar as you have to grasp what it can do and how you speak to it. By the same token, harnessing Voice Control is decidedly not like using a command line. The syntax has structure, but isn’t so rigid that it requires absolute precision. The truth is Voice Control is flexible; it is designed to be deeply customizable (more on this below). And of course, emblematic of the Apple ecosystem, the fundamentals of Voice Control work the same way across iOS and macOS.

Voice Control also is integrated system-wide,⁴ so it isn’t necessary to be in a particular app to invoke a command. Take writing an email, for instance. Imagine a scenario in which you’re browsing the web in Safari and suddenly remember you need to send an email to someone. With Safari running on your iPad (or iMac or iPhone), you can tell Voice Control to “Open Mail” and it’ll launch into Mail. From there, you can say “Tap/Click New Message” and a compose window pops up. Complete the metadata fields (send-to, copies, subject) with the corresponding commands, then dive into composing your message in the body text field. When you’re finished, saying “Tap Send” sends the message.

As the axiom goes, in my testing “it just works.” To initiate Voice Control, you tell the computer to “wake up.” This command tells the system to get ready and start listening for input. Basic actions in Voice Control involve one of three trigger words: “Open,” Tap,” and “Click.” (Obviously, you’d use whichever of the last two was appropriate for the operating system you’re on at the moment). Other commands, such as “Double Tap,” “Scroll,” and “Swipe,” are common actions as well depending upon context. When you’re done, saying “go to sleep” tells the computer to drop the mics and stop listening.

Using Voice Control’s grid to zoom in on a specific area in Maps.

In addition to shouting into the ether, Voice Control includes a numbered grid system which lets users call out numbers in places where they may not know a particular name. In Safari, for example, the Favorites view can show little numbers (akin to footnotes) alongside a website’s favicon. Suppose MacStories is first in your Favorites. Telling the computer to “Open number 1” will immediately bring up the MacStories homepage. However many favorites you have, there will be a corresponding number for each should you choose to enable the grid (which is optional). You can also say “show numbers” to bring it up too. The grid is pervasive throughout the OS, touching everything from the share sheet to the keyboard to maps.

The grid system option lives in a submenu of Voice Control settings called Continuous Overlay, which Apple says “speeds interaction” with Voice Control. In addition to the grid system, there also are choices to show nothing, only show item numbers (sans grid), or item names.

You can optionally enable persistent labels for item names or numbers.

Beyond basic actions like tapping, swiping, and clicking, Voice Control also supports a range of advanced gestures. These include long presses, zooming in and out, and drag and drop. This means Voice Control users can fully harness the power and convenience of “power user” features like 3D Touch and Haptic Touch, as well as right-click menus on the Mac. Text-editing features like cut, copy, and paste and emoji selection are also supported. Some advanced commands include “Drag that” and
“Long press on [object].”

Voice Control can be configured by the user to create a customized experience that’s just right for them. On iOS and the Mac, going to Voice Control in Settings shows a cavalcade of options. You can enable or disable commands such as “Show Clock” and even create your own. There are numerous categories, offering commands to everything from text selection to device control (e.g. rotating your iPad) to accessibility features and more. Voice Control is remarkably deep for a 1.0.

Customization of Voice Control commands.

One notable section in Voice Control settings is what Apple calls Command Feedback. Here, you have options to play sounds and show hints while using Voice Control. In my testing, I’ve enjoyed having the latter two enabled because they’re nice secondary cues that Voice Control is working; the hints are especially helpful whenever I get stuck or forget what to say. It’s a terrific little detail that’s visually reminiscent of the second-generation Apple Pencil’s pairing and battery indicator. My only complaint would be that I wish the hint text were bigger and higher contrast.⁵ A small nit to pick.

Another noteworthy section is Vocabulary. Tapping the + button allows users to teach Voice Control words or phrases it wouldn’t know otherwise. Where this can come in particularly handy is when using industry-specific jargon often. If you’re an editor, for example, you could add common journalistic shorthands for headline (“hed”), subhead (“dek”), lead (“lede”), and paragraph (“graf”), amongst others, to make editing copy and working with colleagues easier and more efficient.

It’s worthwhile spending some time poking around in Voice Control’s settings as you play with the feature to get a sense of its capabilities. As mentioned, Voice Control has tremendous breadth and depth for a first version product; looking at it now, it’s easy to get excited for future iterations.

On a privacy-related note, in my briefings with Apple at WWDC the company was keen to emphasize that Voice Control was built to be privacy-conscious. Audio transmission is local to the device, never sent to iCloud or another server. Apple does, however, provide an Improve Voice Control toggle that “shares activity and samples of your voice.” It is opt-in, disabled by default.

In the Trenches with Voice Control

To describe what Voice Control is and how it works is one way to convey its power and potential, but there is nothing like actively using it to see the kind of impact it has on someone’s life. For Ian Mackay, Voice Control lives up to the hype.

“When I first heard about Voice Control, my jaw dropped. It was exactly what I had been looking for,” he said.

In practice, Mackay finds Voice Control “impressively” reliable, noting that the Siri dictation engine, which Voice Control uses, is “quite accurate and, in my opinion, very intuitive.” He’s pleased Voice Control works the same way cross-platform, as using a universal dictation system in Siri “really lessens the learning curve and lets you focus your understanding on one intuitive set of commands.” This familiarity, he said, is key. It makes using documents and other files a seamless experience as he traverses different devices. This is all due to the continuity inherent to the same dictation engine driving Voice Control everywhere.

Todd Stabelfeldt, a software developer and accessibility advocate who appeared in Apple’s The Quadfather video and gave a lunchtime talk at WWDC 2017, is cautiously optimistic about Voice Control. “I thought like a software developer [when Voice Control was announced],” he said. “The time spent to create, design, write, and most importantly test! [I’m] generally excited, but as I have learned from my wife: ‘Trust but verify.’”

For his part, Stabelfeldt uses Dragon Naturally Speaking for his daily voice control needs, but is excited for Voice Control on iOS, especially for telephone calls and text messages. “With the amount of phone calls, text messages, [and] navigating I do during the day, having Voice Control will make these tasks a little easier and should assist with less fatigue,” he said.

As with all software, however, Voice Control is not so perfect that it can’t be improved. Mackay told me Voice Control falters at times in loud, crowded settings, and on noisy, windy days outside. He’s quick to note ambient noise is a problem for all voice-recognition software, not just Apple’s implementation. “The device has to hear what you’re saying, and although the microphones are great, enough background noise can still impair its accuracy,” he said. A workaround for Mackay is to use Switch Control in places where Voice Control can be troublesome. In fact, he thinks both technologies, which function similarly, complement each other well.

“[Noisy environments] are a great example of how Voice Control and Switch Control can work beautifully in tandem,” he said. “When you are in a noisy area, or perhaps you want to send a more private message, you can use Switch Control to interact with your phone. It also can really speed up your device use by using both technologies. Users will find that some things are faster with switch and some are faster with voice.”

Stabelfeldt echoes Mackay’s sentiment about noisy environments, saying it’s “part of the problem” with using dictation software. He added the Voice Control experience would be better if Apple created “an incredible headset” to use with it.⁶

Considering Voice Control and Speech Delays

Benefits notwithstanding, the chief concern I had when Voice Control was announced was whether I – or any other person with a speech delay – would be able to successfully use it. In this context, success is measured by the software’s ability to decipher a non-standard speech pattern. In my case, that would be my stutter.

I’ve written and tweeted about the importance of this issue a lot as digital assistants like Siri and Amazon’s Alexa have risen in prominence. As the voice is the primary user interface for these products,⁷ Voice Control included, the accessibility raison d’être is accommodating users, like me, who have a speech impairment.

Speech impairments are disabilities too. The crux of the issue is these AI systems were built assuming normal fluency, which is a hard enough task given how humans are trying to teach machines to understand other humans. Ergo, it stands to reason that a stutterer’s speech would compound things by making the job exponentially more difficult. The problem lies in the fact that there is not a trivial number of people out there with some sort of speech delay. According to the National Institute on Deafness and Other Communication Disorders, some 7.5 million people in the United States “have trouble using their voices.” We deserve to experience voice interfaces like anyone else to reap the benefits they bring as an assistive technology.

Yet for a certain segment of users – those with speech impairments – there is a real danger in voice-first systems like Siri, Alexa, and yes, Voice Control being perceived as mostly inaccessible by virtue of their inability to reliably understand when someone stutters through a query or command. Exclusion by incompetence is a lose-lose situation for the user and the platform owner.

It’s unfortunate and frustrating because it means the entire value proposition behind voice technology is lost; there’s little incentive to use it if it has trouble understanding you, regardless of the productivity gains. That’s why it’s so imperative for technology companies to solve these problems – I have covered Apple at close range for years so they’re my main focus, but to be clear, this is an industry-wide dilemma. I can confirm the Echo Dot sitting on my kitchen counter⁸ suffers the same setbacks with understanding me as Siri does on any of my Apple devices. To be sure, Amazon, Apple, Google, Microsoft, all the big players with all the big money have an obligation to throw their respective weights around to ensure the voice-driven technologies of the future are accessible to everyone.

In my testing of Voice Control, done primarily on a 10.5-inch iPad Pro running the iPadOS public beta, I’m pleased to report that Voice Control has responded well (for the most part) to my speech impediment. It’s been a pleasant surprise.

Stuttering, for me, has been a fact of life for as long as I can remember. It will happen, no matter what. But in using things like my HomePod or Voice Control, I have made a concerted effort to be more conscientious of my mindset, breathing, and comfort level. These all are factors that contribute to whether I stutter more severely (e.g. when I’m anxious or nervous), and they definitely play a role in how I use technology. Thus, while testing Voice Control, I’ve constantly reminded myself to slow down and consider what I should say and how I should phrase it.

And it has worked well, all things considered. Voice Control doesn’t understand me with 100 percent accuracy, but I can’t expect it to. It does a good job, though, about 80–90 percent of the time. Whatever work Apple has done behind the scenes to improve the dictation parser is paying off; it has gotten better and better over time.

Herrlinger did tell me at WWDC that, in developing Voice Control, Apple put in considerable work to improve the dictation parser so that it’d handle different types of speech more gracefully. Of course, the adeptness should grow with time.

Overall, the progress is very heartening. No matter the psychological tricks I use on myself, the software still needs to perform at least reasonably well. That Voice Control has exceeded my expectations in terms of understanding me gives me hope there’s a brighter future in store for accessible AIs everywhere.

Voice Control and the Apple Community

During and after the WWDC keynote, my Twitter feed was awash in praise and awe of Voice Control. That it resonated with so many in the Apple community is proof that Voice Control is among the crown jewels of this year’s crop of new features.

Matthew Cassinelli, an independent writer and podcaster who previously worked in marketing at Workflow prior to Apple acquiring them in 2017, is excited for how Voice Control can work with Shortcuts. He believes Voice Control and the Shortcuts app “seem like a natural pairing together in iOS 13 and iPadOS,” noting that the ability to invoke commands with your voice opens up the OS (and by extension, shortcuts) in ways that weren’t possible before. He shares a clever use case. One could, he says, take advantage of Voice Control’s Vocabulary feature to build custom names for shortcuts and trigger them by voice. Although the Shortcuts app is touch-based, Cassinelli says in his testing of Voice Control that any existing shortcuts should be “ready to go” in terms of voice activation.

Beyond shortcuts, Cassinelli is effusive in his feelings about voice-controlled technology as a whole. He feels Voice Control represents a “secret little leap” in voice tech because of the way it liberates users by allowing them near-unfettered control of their computer(s) with just their voice. The autonomy Voice Control affords is exciting, because autonomy is independence. “Now anyone with an Apple device truly can just look at it, say something, and it’ll do it for you,” he said.

Cassinelli also touched on Voice Control alleviating points of friction for everyone, as well as its broader appeal. He notes Voice Control removes much of the “repetitiveness” of using “Hey Siri” as often because it can do so much on its own, and Apple’s facial awareness APIs guard against unintended, spurious input.⁹

“I suspect a select few will take this to another level beyond its accessibility use and seek out a Voice Control-first experience, whether it be for productivity purposes or preventive ergonomic reasons,” he said. “I can see use cases where I’m using an iPad in production work and could utilize the screen truly hands-free.”

Rene Ritchie, known for his work at iMore and his burgeoning YouTube channel, Vector, told me he was “blown away” by Voice Control at WWDC. Looking at the big picture, he sees Apple trying to provide a diverse set of interfaces; touch-first may be king, but the advent of Voice Control is further proof that it isn’t the one true input method. Ritchie views Apple as wanting to “make sure all their customers can use all their devices using all the different input methods available.”

“We’ve seen similar features before from other platforms and vendors. But having [Voice Control] available on all of Apple’s devices, all at the same time, in such a thoughtful way, was really impressive,” he said.

Like Cassinelli, Ritchie envisions Voice Control being useful in his own work for the sheer coolness and convenience of it. “I do see myself using Voice Control. Aside from the razzle-dazzle Blade Runner sci-fi feels, and the accessibility gains, I think there are a lot of opportunities where voice makes a lot of sense,” he said.

Voice Control’s Bright Future

Of the six WWDCs I’ve covered since going for the first time in 2014, the 2019 edition sure felt like the biggest yet. Much of the anticipatory lull had to do with the extreme iPad makeover and the pent-up demand for the new Mac Pro. The rumor mill predicted these things; the Apple commentariat knew they were coming.

Voice Control, on the other hand, was truly a surprise. Certainly, it was the one announcement that tugged at the heartstrings the hardest; some of the biggest applause at the keynote came right after Apple’s Voice Control intro video ended. As transformative as something like iPadOS will be for iPad power users, you didn’t hear about Craig Federighi cutting onions¹⁰ over thumb drive support in Files.

It was important enough to merit time in the limelight on stage – which, for the disabled community and accessibility at large, is always a huge statement. The importance cannot be overstated; to see the accessibility icon on that giant screen boosts so much awareness of our marginalized and underrepresented group. It means accessibility matters. To wit, it means disabled people use technology too, in innovative and life-altering ways – like with using Voice Control on your Mac or iPhone.

From an accessibility standpoint, Voice Control was clearly the star of the show. When it comes to accessibility, Apple’s marketing approach is consistent, messaging-wise, with “bigger” fish like milestone versions of iOS and hardware like the iPhone. They love every new innovation and are genuinely excited to get them into customers’ hands. But this year, I’ve never seen anything like the emotion that came from discussing and demoing Voice Control. It still was Apple marketing, but the vibe felt very different.

It’s early days for Voice Control and it has room to grow, but it’s definitely off to a highly promising start. Going forward I’ll be interested to see what Apple does with the feedback from people like Ian Mackay and Todd Stabelfeldt, who really push this tech to its limits every single day. In the meantime, I believe it’s not hyperbolic to say Voice Control as it stands today will be a game-changer for lots of people.

The 3GS was also significant for bringing discrete accessibility features to iOS (née iPhone OS) for the first time. There were four: VoiceOver, Zoom, White-on-Black, and Mono Audio. ↩︎
Apple likes recycling product names. See also: the recently-departed 12-inch MacBook. ↩︎
Voice Control still exists! Users can determine the function of the side button when held down; one of the options is Voice Control. It can be set by going to Settings ⇾ Accessibility ⇾ Side Button. ↩︎
On the Mac, it’s found via System Preferences ⇾ Accessibility ⇾ Display. On iOS, it’s Settings ⇾ Accessibility ⇾ Voice Control. ↩︎
Speaking of contrast, SF Symbols are a triumph. ↩︎
Apple has sold accessibility-specific accessories, online and in their retail stores, since 2016. ↩︎
In Apple’s case, Siri does offer an alternative UI in the form of Type to Siri. ↩︎
I bought one to pair with the Echo Wall Clock. It’s a nice visual way to track timers when I cook, and of course, the clock itself is useful for telling time. ↩︎
On iOS devices with Face ID, turning your head away disables the microphones. If someone walks into the room, your conversation won’t be misinterpreted as input. ↩︎
In the second sense of the word. ↩︎

Now You See It, Now You Don’t: A Review of the MOFT Invisible Stand

Using Simon Willison’s LLM CLI to Process YouTube Transcripts in Shortcuts with Claude and Gemini

AI Adds a New Dimension to DEVONthink 4

This Week's Sponsor:

Hello, Computer: Inside Apple’s Voice Control