Did you know? DZone has great portals for Python, Cloud, NoSQL, and HTML5!
HTML5 Zone is brought to you in partnership with:

John Esposito curates content at DZone, while writing a dissertation on ancient Greek philosophy and raising two cats. In a previous life he was a database developer and network administrator. John is a DZone Zone Leader and has posted 268 posts at DZone. You can read more from them at their website. View Full User Profile

Web-Standard Speech Recognition: W3C Report and Unofficial Spec

01.06.2012
Email
Views: 7967
  • submit to reddit
The HTML5 Microzone is presented by DZone and Microsoft to bring you the most interesting and relevant content on emerging web standards.  Experience all that the HTML5 Microzone has to offer on our homepage and check out the cutting edge web development tutorials on Script Junkie, Build My Pinned Site, and the HTML5 DevCenter.

Nothing says 'truly modern technology' like speech recognition -- unless it's gesture recognition, but that requires more specialized, less common hardware (unless Kinect goes as far as Microsoft hopes).

If HTML5 is going to make the web truly modern, then it needs to go beyond 'making web development easier, faster, and better'. In particular, user experience lies squarely in HTML5's sights -- and yet, whenever I babble about some awesome new API, my non-developer friends will (reasonably) raise a skeptical eyebrow and snort, 'Until I can talk to Google, I won't be impressed.'

The W3C HTML Speech Incubator Group has taken the first steps to meeting this non-developer UX demand, aiming its ambitious eye at a truly voice-enabled web. (Incubator Groups define a need as precisely as possible; Working Groups try to meet the need as practically as possible.)

Last month the Group published its final report, which satisfied the following general requirement (specified the group's charter):

The mission of the HTML Speech Incubator Group, part of the Incubator Activity, is to determine the feasibility of integrating speech technology in HTML5 in a way that leverages the capabilities of both speech and HTML (e.g., DOM) to provide a high-quality, browser-independent speech/multimodal experience while avoiding unnecessary standards fragmentation or overlap.


Following HTML5's 'use-cases first' approach, the report listed these as the desired use-cases, in roughly prioritized order:

 
Reaching deeper than use-cases, a series of technical requirements were discussed and divided into 'strong interest', 'moderate interest', and 'mild interest'. The requirements list gets rather long, so I won't reproduce it in this article (click here for the appropriate section of the report, but basically: the 'strong interest' requirements focused on making speech recognition as smart, unobtrusive, and concurrent as possible (letting web apps specify domain-specific grammars; allowing audio processing before capture completion; returning useful recognition or non-match errors to the web app), the 'moderate interest' requirements included transport-specific technical matters and pie-in-the-sky impracticalities (the spec must include a mandatory codec without IP issues; recognition without specified grammar should be possible), and the 'mild interest' requirements, it turns out, received less direct attention in part because several were more or less implied in the 'high interest' list.

The final report is quite long, and proposes (rudiments of) both an JavaScript API and a specialized speech protocol. The API now has its own unofficial draft in standard W3C spec format (which is a little easier to read as a separate document); the speech protocol is defined as a sub-protocol of WebSockets, chiefly because WebSockets has already done a lot of the low-level duplexing work (which several of the 'high-interest' requirements demand).

So take a look at the report and see what you think. The API and protocol are a long way from their respective final states, no doubt. But if it seems to you that a legitimately bi-directional voice-enabled web is a good thing, this first-month-after-report-publication is a great time to start contributing your own ideas.

 

Published at DZone with permission of its author, John Esposito.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

HTML5 is the most dramatic step in the evolution of web standards. It incorporates features such as geolocation, video playback and drag-and-drop. HTML5 allows developers to create rich internet applications without the need for third party APIs and browser plug-ins.  Under the banner of HTML5, modern web standards such as CSS3, SVG, XHR2, WebSockets, IndexedDB, and AppCache are pushing the boundaries for what a browser can achieve using web standards.  This Microzone is supported by Microsoft, and it will delve into the intricacies of using these new web technologies and teach you how to make your websites compatible with all of the modern browsers.

Comments

John Esposito replied on Fri, 2012/01/06 - 12:32pm

The term 'HTML Speech Incubator Group' is kinda funny, though.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.