如果您对Okcupid的后端团队的挑战很感兴趣,188bet金宝搏官网we’re hiring!!

作为约会申请,体验的一个组成部分是根据您的无数偏好和您的潜在匹配设置的无数偏好的潜在匹配。As you can imagine, there are many incentives to optimize this part of the experience as it is the very first step everyone starts at before getting to a match, a conversation, and beyond.

Your set preferences, however, aren’t the only factors in how we recommend to you potential matches (or in recommending other potential matches to you). If we had simply shown all the users that met your criteria without any sort of ranking, the end result would be way less matches. For example, if we didn’t try to incorporate a user’s recent activity into the results, there would be a much higher chance that you spend more of your time interacting with someone who hasn’t used the app recently. That certainly doesn’t set users up for success! Beyond simply the preferences you and others set, we leverage numerous algorithms and factors to recommend the users that we think you should see.

When serving recommendations we need to serve the best results at that point in time and allow you to continuously see more recommendations as you like or pass on your potential matches. In other apps where the content itself may not be changing often or such timeliness is less critical, this could be done through offline systems, regenerating those recommendations every so often. For example, when using Spotify’s “Discover Weekly” feature you can enjoy a set of recommended tracks but that set is frozen until the next week. In the case of OkCupid, we allow users to endlessly view their recommendations in real time. The “content” that we recommend -- our users -- are highly dynamic in nature (e.g. a user can join, change their preferences, profile details, location, deactivate at any time, etc.) and can change to whom and how they should be recommended, so we want to make sure that the potential matches you see are some of the best recommendations you can see at that point in time.

To tap into the various ranking algorithms while being able to continuously serve recommendations in real-time, we need to make use of a search engine that is constantly kept up-to-date with user data and provides the capability to filter and rank potential candidates.

What problems the existing matching system has

OkCupid has been utilizing a custom inhouse matching system for years. We won’t go into full detail on that matching system but at a high level, imagine a map-reduce framework over shards of the user space with each shard containing in-memory some portion of relevant user data that is used in processing various filters and sorts on-the-fly. Searches fan out to all shards and ultimately the results are merged to return the top k candidates. This custom-built matching system has served the team well, so why did we decide to change this system now?

在团队增长的情况下,在未来几年支持基于建议的项目,我们知道我们需要改造这个系统。One of the biggest pain points was in development as schema updates like adding a new piece of data about a user (e.g. a user’s preferred gender tags) required hundreds to thousands of lines of boilerplate code and deployment required careful coordination to ensure all parts of the system were deployed in the right order. Simply trying to add a new way to filter the user set or to add a new way to rank results required half a day of an engineer's time to manually deploy to every shard in production and remain apprised of issues that might come up; rollbacks weren’t much faster. More importantly, it was becoming difficult to operate and scale the system since shards and replicas were manually allocated and distributed across a fleet of bare metal machines. Early in 2019 as load on the match system increased, we needed to increase search capacity so we added another replica set by manually placing service instances across multiple machines -- a multi-week effort between the backend and operations teams. At this time we also started to notice performance bottlenecks in the inhouse built service discovery system, message queue, etc. While these components had previously served the company well, we were reaching a point in load at which we were uncertain whether any one of these subsystems themselves could scale. We have goals to move more of our workload into a cloud environment and shifting the matching system, itself a laborious task, would also require bringing along all of these other subsystem components.

今天在Okcupi188bet金宝搏官网d中,许多这些子系统都是由更强大的OSS云友好的选择提供服务,并且该团队在过去两年中通过了各种不同的技术,以取得巨大成功。我们不会讨论本博客文章中的那些努力,而是专注于我们通过迁移到更多开发人员友好和可扩展的搜索引擎来解决上述问题的努力:vespa.

It’s a match! Why OkCupid matched with Vespa

Historically OkCupid has been a small team and we knew early on that tackling the core of a search engine would be extremely difficult and complicated so we looked at open source options that we could support our use cases with. The two big contenders were Elasticsearch and Vespa.

Elasticsearch.

This is a popular option with a large community, documentation, and support. There are numerous features and it’s even used byTinder。在开发经验方面,可以使用Put映射添加新的架构字段,可以通过结构休息呼叫来完成查询,有一些支持查询时间排名,写入自定义插件的能力等于缩放和维护,只需要确定碎片数量和系统对您的复制品分发。缩放要求重建另一个具有更高碎片计数的索引。

我们选择退出弹性研究的最大原因之一是缺乏真正的内存部分更新。这对我们的用例非常重要,因为我们将要索引的文件,我们的用户需要通过喜欢/传递,消息传递等经常更新。这些文档与广告或图像等内容相比,这些文档是高度动态的。这主要是具有不经常变化的属性的静态对象,因此更新的低效读写周期是我们的主要性能问题。

vespa.

这只是几年前开放的源泉claimed支持在用户服务时间存储,搜索,排名和组织大数据。Vespa支持

  • high feed performance through true in-memory partial updates without the need to re-index the entire document (据说up to 40-50k updates per second per node)
  • 提供灵活的排名框架,允许在查询时间进行处理
  • directly supports integration with machine-learning models (e.g. TensorFlow) in ranking
  • 可以通过休息呼叫中的富有表现力的YQL(雅虎查询语言)来完成查询
  • the ability to customize logic via Java components

在缩放和维护方面,你never think about shards不再 - 您配置内容节点的布局,Vespa自动处理将文档设置为桶,复制和分发数据。此外,每当添加或删除节点时,数据都会自动恢复和重新分配。缩放只是意味着更新配置以添加节点并让Vespa自动重新分发此数据。

总体vespa似乎支持我们的用例最好。188bet金宝搏官网Okcupid包含许多有关用户的不同信息,以帮助他们找到最佳匹配 - 就刚刚过滤器而排序有100多个!我们将始终添加更多的过滤器和排序,因此能够支持工作流程很重要。当它来写作和查询时,Vespa对我们现有的匹配系统最为类似;也就是说,我们的匹配系统还需要在查询时间进行处理以进行快速内存部分更新和实时处理进行排序。Vespa还具有更灵活和更直接的排名框架;在YQL中表达查询的能力与Elasticsearch查询的尴尬结构相反,只是另一个良好的奖金。在缩放和维护方面,Vespa的自动数据分发功能非常吸引我们相对较小的团队规模。所有这些都似乎Vespa会在支持我们的用例和性能要求时为我们提供更好的镜头,同时与Elasticsearch相比,更容易维护。

Elasticsearch更为广泛闻名,我们可以从Tinder的使用中学习,但任何一种选择都需要大量的预期研究和调查。Vespa一直在服务许多生产用例,就像ZEDED., Flickr serving billions of images, andYahoo Gemini广告平台超过十万个请求,每秒为10亿美元的服务用户提供服务。这给了我们信心,这是一个经过战斗,表演和可靠的选择 - 事实上,Vespa的起源已经是around for longer而不是弹性研究。

此外,Vespa团队一直非常涉及和乐于助人。Vespa最初是为服务广告和内容页面提供服务,并且据我们所知,它尚未用于约会平台。我们对Vespa的初步使用挣扎,因为它是如此独特的用例,但Vespa团队一直非常响应,并迅速优化该系统,帮助我们处理出现的几个问题。

How Vespa works and what a search looks like at OkCupid

建筑学

在我们潜入我们的Vespa用例之前,这是关于Vespa如何工作的快速概述。Vespa是众多服务的集合,但是每个Docker容器都可以配置为满足Admin / Config节点,无状态Java容器节点和/或有状态C ++内容节点的角色。可以通过该应用程序包包含配置,组件,ML模型等州API.到配置群集,处理对容器和内容群集的更改应用更改。进料请求和查询全部通过HTTP通过HTTP进行无状态Java容器(允许自定义处理),然后在Content Cluster中的Land and antoud and to to contented查询执行发生的内容层中的函数。在大多数情况下,部署新的应用程序包只需要几秒钟,Vespa处理使这些更改生活在容器和内容集群中,以便您很少必须重新启动任何内容。

What does a search look like?

我们在Vespa集群中维护的文档包含关于给定用户的无数属性。这架构定义定义文档类型的字段以及秩profiles包含适用的排名表达的集合。假设我们有一个架构定义representing a user like so:

搜索用户{文档用户{field userid类型long {索引:摘要|attribute属性:快速搜索排名:筛选}字段latlong类型位置{索引:属性}#unix时间戳字段Lond {indexing:attribute属性:fast-search}#包含此用户文档已喜欢的用户#和相应的用户重量是UNIX时间戳,当时就像发生的字段likedUset类型加权 {indexing:attribute属性:fast-search}} roupl-profile myrankprofile继承默认{rank-properties {查询(lastonlinewight):0查询(incominglikewight):0}函数lastonlinescore(){表达式:查询(lastonlinewight)* freshings(lastonline)}函数incomingliketimestamp(){表达式:RawScore(likedUserset)}函数haslikedme(){表达式:if(incomingliketimestamp> 0,1,0)}函数incominglikescore函数){表达式:查询(incominglikewight)* haslikedme} sight-阶段{lastonlinescore + incominglikescore}}摘要 - 功能{lastonlinescore incominglikescore}}}}}}}}}}}}

indexing: attribute指定表示这些字段应保持内存,以允许我们在这些字段上获得最佳的写入和读取性能。

假设我们使用此类用户文档填充了群集。然后,我们可以在上面的任何字段上进行搜索过滤和排名。例如,我们可以向默认搜索处理程序发出发布请求http://localhost:8080/search/找到用户,除了我们自己的用户777.距离我们的位置50英里,自时间戳以来一直在线1592486978, ranked by most recent activity, and keeping the top two candidates. Let’s also select the概要为了帮助我们看到我们在排名型材中的每个排名表达的贡献:

{ "yql": "select userId, summaryfeatures from user where lastOnline > 1592486978 and !(userId contains \"777\") limit 2;", "ranking": { "profile": "myRankProfile", "features": { "query(lastOnlineWeight)": "50" } }, "pos": { "radius": "50mi", "ll": "N40o44'22;W74o0'2", "attribute": "latLong" }, "presentation": { "summary": "default" } }

We might get a result like:

{“root”:{“id”:“toplevel”,“相关性”:1.0,“字段”:{“totalcount”:317},“coverage”:{“coverage”:100,“文档”:958,“完整“:True,”节点“:1,”结果“:1,”结果“:1},”儿童“:[{”ID“:”索引:用户/ 0 / BDE9BD654F1D5AE17FD9ABC3“相关性”:48.99315843621399,“源”:“用户”,“字段”:{“UserID”:{“userid”:“suffilefeatures”:{“排名表达(incominglikescore)”:0.0,“排名表达(Lastonlinescore)”:48.99315843621399,“Vespa.summaryFeatures.cached”:0.0}},{ID“:”索引:用户/ 0 / E8AA37DF0832905C3FA1DBBD“,”相关性“:48.99041280864198,”源“:”用户“,”字段“:{”UserID“:6888497210242094612,”摘要特点“:{“排名表达(incominglikescore)”:0.0,“排名表达(Lastonlinescore)”:48.99041280864198,“Vespa.summaryFeatures.cached”:0.0}}]}}}}}}

过滤匹配时的命中第一阶段评估排名表达式以对命中率进行排名。这relevancereturned is the overall score as a result of all the第一阶段排名功能秩-profilewe’ve specified in our query, i.e.排名。新闻myRankProfile。在列表中排名我们指定了一个功能查询(LastonlineWeight)50 of 50,然后在我们使用的唯一排名表达式中引用:Lastonlinescore.。它利用内置的排名特征新鲜如果近来近来的时间戳是与当前时间戳相比,则这是一个接近1的数字。到目前为止这么好,没什么太棘手了。

与静态内容不同,此内容可能影响您是否应该看到它们。例如,他们可以喜欢你!我们可以索引一个weighted setlikedUserSet每个用户文档的字段,该字段保存为键,它们已喜欢的UserID以及作为类似的值的值。滤网已经很直接过滤那些喜欢你的人(例如,添加alikedUserSet contains \”777\”yql子句),但我们如何在排名期间纳入该加权集信息?我们如何提升用户使用用户的用户?

In the previous results the ranking expressionincominglikescore.这两次命中都是0。用户6888497210242094612实际上是喜欢的用户777., but this isn’t currently accessible in ranking, even if we had provided"query(incomingLikeWeight)": 50。我们可以利用a在YQL中的函数(第一个,也是只有第一个,的参数秩()函数确定文档是匹配的,但所有参数都用于计算等级分数,然后使用aDOT产品在我们的YQL排名子句中存储和检索原始分数(在这种情况下,当用户喜欢我们时的时间戳)如下:

{“yql”:“选择userId, summaryfeatures从用户where !(userId contains \"777\") and rank(lastOnline > 1592486978, dotProduct(likedUserSet, {\"777\":1})) limit 2;", "ranking": { "profile": "myRankProfile", "features": { "query(lastOnlineWeight)": "50", "query(incomingLikeWeight)": "50" } }, "pos": { "radius": "50mi", "ll": "N40o44'22;W74o0'2", "attribute": "latLong" }, "presentation": { "summary": "default" } }
{“root”:{“id”:“toplevel”,“相关性”:1.0,“字段”:{“totalcount”:317},“coverage”:{“coverage”:100,“文档”:958,“完整“:真实,”节点“:1,”结果“:1,”结果“:1},”儿童“:[{”ID“:”索引:用户/ 0 / E8AA37DF0832905C3FA1DBBD“,”相关性“:98.97595807613169,“来源”:“用户”,“字段”:{“userid”:688849721024209461212420946121242094612242094612,“suffilefeatures”:{“排名表达(incominglikescore)”:50.0,“排名表达(Lastonlinescore)”:48.97595807613169,“Vespa.summaryFeatures.cached”:0.0}},{“ID”:“索引:用户/ 0 / BDE9BD654F1D5AE17FD9ABC3”相关性“:48.9787037037037,”源“:”用户“,”字段“:{”UserID“:{”userID“:-5800469520557156329,”suffilefeatures“:{“RankingExpression(Incominglikescore)”:0.0,“排名表达(Lastonlinescore)”:48.9787037037037,“Vespa.summaryFeatures.cached”:0.0}}}}}}}}

Now user6888497210242094612已经提升到顶部,因为他们喜欢我们的用户和他们的用户incominglikescore.nets the full value. Of course, we actually have the timestamp when they liked us so we could utilize it in more sophisticated expressions, but we’ll keep it simple for now.

这证明了如何通过排名框架过滤和排名结果的机制。排名框架提供了一种灵活的方法来应用排名表达式(主要是数学)对查询时间的命中。

自定义中间件Java层

如果我们想要支持不同的路径,并使得dotproduct子句隐含地是每个查询的一部分?这就是可定制的Java容器层进入的地方 - 我们可以写一件自定义搜索者成分。这让我们处理任意参数,重写查询,并以某种方式处理结果。这是Kotlin中的一个例子:

@After(phasenames.transformed_query)class matchsearcher:searcher(){companion对象{// http查询参数val userid_query_param =“userid”val attribute_field_liked_user_set =“likedUserset”}覆盖有趣搜索(查询:查询,执行:执行):结果{val userid = query.properties()。getstring(userid_query_param)?tolong()//添加dotproduct子句(userid!= null){val rankitem = query.model.querytree.getRankItem()Val likedUserSetClause = dotproductItem(attribute_field_liked_user_set)likedUsersEtClause.addtoken(UserID,1)RaveItem.Additem(likedUserSetClause)} //执行查询Query.trace(“YQL之后是:$ {query.yqlrepresentation()}”,2)返回执行.search(查询)}}

这n in ourservices.xml文件我们可以配置此组件如下:

... <搜索> <链ID =“默认”继承=“Vespa”>    <处理程序ID =“默认”捆绑包=“match-searcher”> <绑定> http:// *:8080 / match   ...

然后,我们只需构建和部署应用程序包,现在我们将查询到自定义处理程序时http://localhost:8080/match?userid=777

{“yql”:“选择userId, summaryfeatures从用户where !(userId contains \"777\") and rank(lastOnline > 1592486978) limit 2;", "ranking": { "profile": "myRankProfile", "features": { "query(lastOnlineWeight)": "50", "query(incomingLikeWeight)": "50" } }, "pos": { "radius": "50mi", "ll": "N40o44'22;W74o0'2", "attribute": "latLong" }, "presentation": { "summary": "default" } }

我们以前恢复了同样的结果!请注意,在Kotlin代码示例中,我们添加了一条跟踪,以在我们修改后打印出YQL表示,如果我们设置tracelevel=2in the URL params, the response also shows:

。。。{ "message": "YQL after is: select userId, summaryfeatures from user where ((rank(lastOnline > 1592486978, dotProduct(likedUserSet, {\"777\": 1})) AND !(userId contains \"777\") limit 2;" }, ...

中间件Java容器层是添加自定义逻辑处理的强大方法搜索人员or to customize rendering of results via渲染器。We customize our搜索者处理如上所述的案例和我们想要在我们的搜索中隐含的方面。例如,我们支持的一个产品概念是“相互契合”的想法 - 您可以搜索具有某些偏好标准的用户(如年龄范围和距离),但您也必须符合候选人的搜索条件。支持我们的这种用例搜索者组件我们可能会获取搜索用户的文档,以在后续扇出查询中提供其一些属性以进行过滤和排列。排名框架和自定义中间件层一起为我们提供了一种灵活的方式,以支持我们的许多用例。我们只在这些示例中介绍了一些方面,但有广泛的文档可用here

它是如何建立出来和生产Vespa集群的

在2019年春天,我们开始哈希计划建立这个新系统。此时,我们还与Vespa团队联系并定期向他们咨询我们的用例。我们的运营团队估计并构建了初始群集设置,后端团队开始记录,设计和原型在Vespa中的各种用例。

Early prototyping phases

在Ok188bet金宝搏官网cupid,后端系统用Golang和C ++编写。为了在Vespa中编写自定义逻辑组件,并通过使用使用的高进料速率Java Vespa HTTP Feed客户端API,我们必须熟悉JVM环境 - 我们最终利用Kotlin定制Vespa组件以及我们的喂养管道。

从那里开始,多年来一直搬出了多年的应用程序逻辑,并在vespa中揭开了可能的东西,必要时咨询Vespa团队。我们的大部分匹配系统逻辑都在C ++中,因此我们还添加了逻辑来转换我们当前的过滤器数据模型,并排序到我们通过休息到Vespa群集的等效YQL查询。早期我们还确保建立一个良好的管道,用于将群集重新填充与文件的完整文件;原型设计将涉及许多更改以确定利用和无意中需要更新的文档的正确字段类型。

监控和负载测试

As we built out our Vespa search cluster, we needed to make sure of two things: that it could handle anticipated search and write traffic and that the recommendations served by this system were comparable in quality to the existing matching system.

在加载测试之前,我们到处添加了Prometheus指标。vespa.-exporter提供大量的统计数据,Vespa本身也暴露一小组额外的额外指标。由此,我们在每秒查询时创建了各种Grafana仪表板,延迟,Vespa进程等资源使用等。我们也跑了Vespa-fbench.to test out query performance and with the help of the Vespa team determined that due to relatively high静态查询成本那个grouped layoutwould provide us higher throughput. In a flat layout, adding more nodes would mainly only cut down on the dynamic query cost (i.e. the portion of the query that depends on the number of documents indexed). A grouped layout means that each configured group of nodes would contain the full document set and thus a single group could serve a query. Due to our high static query cost, while keeping our node count the same we increased our throughput much more by increasing the number of groups from a flat layout of effectively one group to three groups. Lastly, we also performed live shadow traffic testing after we had gained confidence in the static benchmarks.

性能优化

我们面临的最大障碍之一,豪ver, was in feed performance. Early on we had trouble handling even 1,000 QPS of updates. We had heavily utilized weighted set fields but these weren’t performant at first. Luckily the Vespa team promptly helped fix these issues as well as others around data distribution. Since then the Vespa team has also added extensive documentation on喂养尺寸,其中许多我们在某种程度上雇用:在可能的大量加权集中的整数字段,允许通过设置批处理visibility-delay,利用少数条件更新,并具有依赖于属性(即内存)字段的那些,并通过在饲养管道中的操作中通过压缩和合并操作来减少客户往返。现在,管道在稳定状态下舒适地处理3K QP,我们的适度集群已被观察到在任何原因有积压的操作时处理11K QPS更新。

推荐质量

After we were confident that the cluster could handle the load, we needed to validate that the quality of the recommendations were just as good, if not better than the existing system. It wasn’t possible to perfectly replicate all existing behavior, yet any minor deviation in how the ranking was implemented would have outsized effects on the general quality of the recommendations and the overall ecosystem in general. For this we applied our实验系统,一些测试组通过Vespa获得了建议,而对照组继续利用现有的匹配系统。我们分析了几个防御性商业指标,重申和修复问题,直到观察到Vespa组的结果,如不如比对照组结果更好。一旦我们对Vespa服务的结果充满信心,我们只需将建议查询路由到Vespa集群。我们能够将所有搜索流量交换到Vespa集群,而无需挂钩!

系统图

最终,新系统的简化架构概述如下所示:
OKC-Vespa-architecture-Redux

Vespa现在如何以及下一个什么

让我们与我们的遗留系统相比,比较现在由Vespa支持的匹配系统的状态:

  • 架构更新
    • Before: a calendar week spent on hundreds of lines of code changes, and a carefully coordinated deployment with multiple subsystems
    • 之后:几个小时才能为架构定义添加一个简单的字段,并部署应用程序包
  • 添加新排序
    • 之前:半天在部署中花了
    • 之后:排名表达式也是模式定义的更新,可以部署到实时系统。这意味着只需要几秒钟才能生效!
  • 缩放和维护
    • 之前:多周精力来手动分发碎片和生产服务运行文件的分布,以实现高可用性
    • 之后:只需将新节点添加到配置文件,Vespa会自动分发数据以满足所需的冗余级别。我们的大部分操作不需要任何人工干预或重新启动任何有状态节点

总体而言,Vespa集群的开发和维护方面一直是Okcupid产品路线图的福音。188bet金宝搏官网自2020年1月底以来,我们通过它制定了我们的Vespa集群并通过它提供了所有的建议。我们还增加了几十个新的字段,排名表达式,以及支持今年的主要产品释放的案例Stacks。与我们之前的匹配系统不同,我们现在正在使用机器学习模型在查询时间。

What’s next?

对于我们来说,Vespa最大的销售点之一是它直接支持与张力的排名和与用框架训练的模型的集成纹orflow.。This capability is one of the major features we hope to continue to leverage in the coming months. We’re already making use of tensors for certain use cases and we’re excited to soon look into integrating more machine learning models that we hope will better predict outcomes and match our users.

此外,Vespa最近释放了对高维近似最近邻索引的支持,该邻居索引是完全实时的,并可搜索和动态可更新。我们期待着实时探索其他用例nearest neighbor搜索。

188bet金宝搏官网Okcupid X Vespa。装运它!

许多人听说过或与Elasticsearch合作,但没有像Vespa周围的社区那样大。我们相信有许多其他应用程序已经用Elasticsearch建造的,这将更好地与Vespa一起服务。Vespa对Okcupid的用例非常匹配,我们很高兴我们制作了这项188bet金宝搏官网投资。这种新的架构使我们能够更快地移动并更快地提供新功能。我们是一个相对较小的团队,所以它也很棒,不必担心操作复杂性。现在我们更准备水平衡我们的搜索能力。我们当然没有能够在没有Vespa的去年取得的进展。有关更多Vespa的信息和技术功能,请务必退房用Vespa Ai搜索和推荐经过@jobergum.

We made the first move in liking and sending the Vespa team a message. They messaged us back and it was a match! We could not have done this without the help of the Vespa team. Special thanks to@jobergum.@geirst.提供有关查询和排名的指导和超级特殊的Shoutout@kkraune@vekterli.所有支持。团队提供的支持和努力水平真正令人敬畏,从深入挖掘我们的用例,以便在短时间内诊断性能问题以在Vespa发动机中提高增强。@vekterli.甚至飞往纽约的办公室,直接与我们直接工作一周,以确保我们的集成和用例可以满足。非常感谢Vespa的团队!

在结束时,我们才触及了关于我们使用Vespa的几个方面,但没有一项这一点,没有我们的后端和运营团队在去年的巨大工作。我们面临着弥合我们现有系统与更现代技术堆栈之间的差距的众多独特的挑战,但其中的是另一个时间的博客帖子。

如果您对Okcupid的后端团队的挑战很感兴趣,188bet金宝搏官网we’re hiring!!