想象一下,你正在为即将到来的财政年度做计划, and suddenly remember that a key budget element that impacts your department was mentioned in an all-hands meeting a week or so ago.

你搜索议程文件, 但不要看到任何符合你记忆的东西, 高管们还没有公布下一季度的完整预算数据, 所以没有办法找到那个细节.

One option is to email the comptroller’s office and ask the team to give you the budget 信息, but you don’t remember enough detail about the particular point the speaker or speakers made in the all-hands meeting. 该怎么做?

A few years ago, searching through hours of all-hands meeting video would’ve been your only option. 但是多亏了计算机视觉, 机器学习, 以及健壮的元数据提取, today’s enterprise video platforms (EVPs) are able to help you find what you need to know to appear knowledgeable when you ask for additional details.

Technologies that perform facial recognition and speech-to-text indexing have been around for at least 2 decades—in fact, a research project in Europe that I was part of around 2004 posited the very scenario mentioned above—but the tools were complex and didn’t easily tie into EVP solutions. 尽管如此,即使是这些工具,每帧也能处理多达32,000条元数据.

Today’s solutions offer more accurate facial recognition and slightly better speech-to-text transcription, the latter in part derivative of what we’d done on both our research project and other projects that focused on virtual beam-forming microphones for use in corporate environments. Yet the true power of these solutions isn’t in the discrete tools but rather the holistic approach to searching indexed content.


第一个, if the presenter was properly recorded—meaning the recording used a good lavaliere microphone and didn’t combine that microphone with 观众 mics in the final recording—then recent advances in speech-to-text processing should yield adequate enough results to find the keywords you’re looking for.

同样的情况也可能发生在几个演讲人身上,前提是他们不互相交谈. 一些解决方案甚至提供了一种区分说话者的方法. 虽然这通常是一个相当基本的区别(例如.g., 演讲者1, 发言人2), 它仍然可以让搜索者过滤特定说话者的搜索.

如果你不记得演讲者是谁怎么办, 或者音频难以辨认以至于无法使用语音转文本转录引擎? 有几种解决方案提供了根据说话者的图像查找视频的功能.

我们都熟悉面部识别, which has become more prevalent thanks to Facebook and integrated facial-matching technologies like those integrated into the Photos app on Apple’s iOS products.

不过,视频面部识别更像是一种黑色艺术. After all, video is at least 24 still images per second, and sometimes up to 60 images per second. 一张静止图像中所蕴含的信息量之大, 或框架, 是惊人的, which is why most facial recognition systems for single still images require several seconds to process each frame.

Add to this the complexities of the way that intraframe compression works— where codecs like H.264使用全帧, 或I-frames, coupled with differential frames like P- or B-frames that don’t store the entire image—and the complexities of decoding and indexing each individual frame rise significantly.

除了, 无论是静止图像还是单帧视频, 如果镜头中有几个人,复杂性也会增加. 总之, the processing required to handle just the facial recognition just for a few seconds of video 是惊人的.

最重要的是, a professionally edited video will often cut back and forth between one of several presenters, 观众, 图形(e).g.、网站或PowerPoint幻灯片).

So not only does the facial recognition portion of an EVP solution need to identify when a presenter appears on screen, but also when that person disappears and then reappears again within a given threshold of time.

换句话说, 人脸识别既要有容差阈值,又要有聚合功能, so users can search for a person and receive results that generalize sections of a video in which a particular presenter appears.

基于以上, 好消息是,这种多面手, multiframe facial recognition might actually help solve the problem of finding the right video clip to help with our budget problem. If you suddenly remember that it was two co-presenters who were talking about the new budget and how it impacts your department, EVP是否有可能同时搜索不止一个人?


其中一种确实以一种基本的形式提供了这种功能,那就是new 微软流 服务. 旨在取代传统的Office365视频服务, 流使得可以选择不止一个人来搜索, 至少对于点播内容来说是这样.

这样做, Stream offers features like audio transcriptions and face detection as a way to find relevant content.

除此之外, Stream还提供搜索视频中出现的文本的功能, ,即使是屏幕上出现的特定单词或人物, 无论是在一个视频或在您的公司的所有视频.”

根据Stream网站, 内置的机器学习智能也推动了可访问性功能, 所以每个人都可以根据自己的需要参与进来.”

微软流 offers audio transcriptions and face detection as ways to find relevant content for an extra $2 per month beyond the $3 per month, 每用户基础费.

订阅了Office365的用户可以使用Stream, 但它并不局限于订阅者. 对于那些没有Office365的用户,定价是按每个用户、每月计算的. The basic 服务, for $3 per month per user, offers a way to aggregate, organize, and search videos. 每个用户每月额外支付2美元, Stream offers two features key to our scenario: search “using deep search based on in-content signals like speech to text,并使用面部检测和音频记录进行搜索.”


在实时视频中实现搜索功能是可能的, 但是上面提到的处理需求使得它在大多数内部部署解决方案中不切实际. The growth areas in making live video searchable will most likely come from cloud-based EVP solutions.

一种可能的方法, if your web-based unified communications tool offers traditional videoconferencing functionality, 是在调用中添加一个具有这些索引特性的端点. 这允许终端开始录制会议, 以与传统网络数字视频录像机(NDVR)大致相同的方式, 然后开始以接近实时的速度处理和索引视频帧.

实际上, 如今,这需要的时间是视频会议实际时间的两到三倍, but algorithm optimizations and increasingly powerful processors may bring this down closer to real time in the next 2 years. Also expect to see this type of feature added to web-only 服务s like Zoom and Skype for Business in the near term.

同样,便携式制作和视频捕捉解决方案也越来越受欢迎. Mike Savello,销售副总裁 LiveU他表示,企业视频直播正在成为该公司目标市场中更大的一部分.

“事实上, 我们现在的目标客户是定期举办内部全球活动的财富500强公司,萨维罗说, noting these events might span a range from CEO addresses and quarterly updates to new product or 服务 announcements.

在过去, Savello说, these types of corporate events might “involve a company renting a production truck and a satellite truck, 哪个会很贵.

虽然通常与“现场”捕获有关,用于体育赛事和新闻, 像LiveU这样的蜂窝连接解决方案正在企业中找到一个家, 尤其是那些定期举办内部全球活动的财富500强公司.

“我们可以提供‘在家’生产,萨维罗说, “by backhauling each camera over cellular or other IP connectivity to a central production platform. 这意味着您不再需要现场生产卡车或座车.”


去年我们听到的一个说法是企业内容可以货币化. 音调通常是相反的, with companies noting that regular social platforms should not be considered as equivalent to EVPs, 有几个原因,包括他们无法将内容货币化.

