It is a speech dialog system that comprises a first microphone, a secondary microphone, a processor and a memory. The first microphone captures the first audio recorded from a single spatial zone and creates an audio signal. The second microphone records the second audio from a different space and creates another sound signal. The processor receives the initial audio signal as well as the second audio signal and the memory contains instructions for the processor to instruct it to operate a speech enhancement module, an automatic speech recognition module, and a speech dialog module that performs a zone-dedicated speech dialog.

The present disclosure pertains to speech processing, and more particularly, to speech processing in a space with several spatial zones, where the zone from which the speech originates is crucial in considering the speech.

The approaches described in this section are possible approaches that could be pursued but are not necessarily methods that have been previously conceived or even pursued. The strategies discussed in this section are not necessarily prior art, and could not apply to claims in this application.

Multi-microphone speech applications require interaction with various components such as speech enhancer (SE) and an automatic speech recognition module (ASR), and an audio dialog (SD) (module). These components are often part of a framework which handles interactions between them. In modern systems, generally: (a) The SE is able to perform multi-channel speech enhancement in order to deliver an output signal that is of higher quality. Multi-channel speechenhancement may include acoustic echo cancellation, spatial filtering and noise reduction such as beamforming, signal separation, or cross-talk cancellation. While the SE offers a single output signal for ASR it may also include multi-input multiple output systems with multiple outputs. The output signal is usually transmitted in blocks of 16 milliseconds each. (b) The ASR is designed to recognize and detect speech utterances, e.g., awake-up-word (WuW) or a series of spoken words that are based on the input signal, yielding a recognition result. All references to “WuW” in the current document are intended to include words that wake up and other speech utterances. (c) The SD could further analyse the recognition results of the ASR and perform additional actions.

The SE can process input signals from multiple microphones and create a focal point in various spatial zones to record the signals of people (e.g. speakers) through spatial filtering. The SE can produce an output signal with spatial focus for listening with a selective focus, i.e. suppression of speech signals coming from other spatial areas. The SE can also combine signals from multiple spatial zones to produce an output signal which allows for broad listening.

A WuW or other speech utterances, may trigger zone-dedicated dialogs between users on systems with several microphones. The system is able to spatially search to the WuW while it is listening. When the detection of a WuW is detected and the system is able to proceed with a speech dialog in which case it is beneficial to only allow one spatial zone, and then assign a spatial focus to this spatial zone for further interaction with a person in that area. During this phase, other spatial zones can be ignored by the system i.e., selective listening.

In some systems , multiple microphones are covering multiple areas, e.g., seats in an automobile, for interaction with users who are located within the spatial zones. There is the possibility to conduct an audio dialogue only with one user, like to control the temperature of the air conditioner or heater for that seat. But, the information about the specific spatial zone that is displaying speech activity is available in the SE, but not available in the ASR or the SD. It is imperative for the SD be aware of the appropriate zone in order to create an appropriate zone-dedicated dialog.

Even if the system is capable of identifying the precise area in which the user spoke the WuW, it is possible for the SE to switch from listening in a broad mode to selective listening mode with a delay. This could be due to the ASR’s recognition process or other delays. If the user continues with the dialog immediately following speaking the WuW, the transition from listening broadly to selective listening might happen at the midpoint of the user’s speech utterance.

U.S. Pat. No. No. 10,229,686, that stands for “Methods and apparatus for speech segmentation with multiple metadata” It refers to the transmission of metadata to an ASR Engine for improved detection of speech beginning. But, it doesn’t mention sending zoneactivity data to help the ASR engine, or any other components of the system that are not part of an SE to determine the spatial area within which a speaker is in.

In some techniques of the prior technology, a selective-listening method is employed, where an SE emits one signal. Framework applications request information from the SE about the zone in which the listening should be changed. The SE’s internal processing mode SE can then be switched between broad listening and selective after the presence of a WuW has been detected. This technique has a drawback: the application framework must regulate the SE’s internal configuration for selective/broad listening modes. The interaction between the framework and SE requires added complexity on the performance of the application framework. Another issue with this technique is that the request to the SEto find out information about the area in which WuW was spoken WuW was spoken might be difficult to manage because of the latencies between the components of the system, and in the event that the components run on different computers the clock skews could be problematic.

Parallel WuW detectors can be used to increase robustness against speech interference. This technique allows an SE always delivers multiple spatially focused outputs that each refer to an individual selectivelistening mode for each of the zones in an array of several spatial zones. When a WuW phase is in progress, multiple instances of an ASR are running in parallel to operate on various output signals from the SE. The framework application selects one of the SE output signals to start a speech dialogue after a WuW is recognized. One drawback of this approach is that it demands an extremely high central processor unit (CPU) load due to numerous active ASR instances running in parallel.

The technical issue addressed by this disclosure is that an ASR can recognize speech utterances but cannot detect the spatial area in which the speech utterance was spoken, and therewith cannot distinguish between desired and interferingspeech components. The technical solution to this problem provided by the present invention is that the SE transmits spatial zone activity information along with an audio signal to the ASR for further processing and distinguishing different spatial zoneactivities.

Another technical problem addressed by this disclosure is that seamless transitions between broad listening and selective listening is not possible due to latencies in the detection of an utterance or recognition of a zone. A technicalsolution to this problem provided in the disclosure in question is to offer multiple audio streams, including selective and broad listening, from the SE to the ASR and buffer them to be in a position to “look back in time” and then resume the recognition within arelevant zone.

The present invention provides an audio-visual system that includes an initial microphone, a second microphone, a processor and a memory. The first microphone is able to capture the first audio signal from a particular spatial zone and generates a first audio signal.The second microphone captures second audio from a different spatial zone, and produces an additional audio signal. The processor receives the first audio signal as well as the second audio signal. the memory holds instructions to control the processorto perform operations that include: (a) a speech enhancement (SE) module that detects, from the first audio signal and second audio signal, speech activity in at least one spatial zones or in the second, thereby creating processed audio; and then determines from which first zone or the second zone where the audio processing is originating, and thus reveals zone activity information. (b) an automatic speech recognition (ASR) module that detects an utterance within the processed audio, which results in an recognized utterances; and basing on the zone activity information, produces a zone decision that identifies the first zone or the second zone the recognized utterance originated; and (c) an audio dialog (SD) module that creates a zone-dedicated speech dialogue in response to the utterance that was recognized and the zone choice.

Furthermore it is the SD module is based on the identified utterance , and the zone decision, decides from which of the first zone or the second zone to obtain additional audio, thus yielding the routing decision. The ASR module, based upon the routingdecision, obtains the additional audio from either of the first zone or the second zone and detects an additional utterance in the added audio.

