solr中的Join Query

Solr中的Join其实有两大类，普通Join和Block Join.

当前页面先说普通的Join。

其实Join操作让我发现，Solr所谓的跨节点查询，是有问题的。

Solr中的查询，是基于每个node的每个replica（也就是内部的Core）。每次query的结算，都必须在一个code中完整的结束，然后在一个node上完整的结束，然后再合并。

这导致如果数据分布在多个shard上，或者某个shard的replica不在当前node上，就会报错，或者搜索结果偏少。

在运行环境中，因为一定需要Join操作，被迫的，将某个collection设置为单shard，并且replica数一定要>=solr node数。

这样的配置，让solr原本号称的cloud模式荡然无存了。

所以solr的计算方式，适合于真正可以分割计算的那种，如果有依赖关系的计算，它是做不到真正的cloud模式的。

不知道Elastic Search是否能做到这一点。

下文中如果测试中不单独说明，则表示为单shard，如果是多shard情况，会特地注明。

准备工作

我们以用测试结果来看看Join的作用。

启动Solr cloud

single mode的solr会更简单，所以这里就只以solr cloud为例。

首先我使用docker拉起一个solr cloud (我本地宿主机是windows)。

为了简化，但是能说明多shard情况下的区别，我们拉起1个zookeeper node，2个solr node。

命令行执行：docker-compose.exe -f solr_cloud.yml up -d

其中，solr_cloud.yml 的内容如下：

version: '3.7'

services:

solr1:

image: solr:8.5.0

container_name: solr1

ports:

- "8981:8983"

environment:

- ZK_HOST=zoo1:2181

networks:

- solr

depends_on:

- zoo1

solr2:

image: solr:8.5.0

container_name: solr2

ports:

- "8982:8983"

environment:

- ZK_HOST=zoo1:2181

networks:

- solr

depends_on:

- zoo1

zoo1:

image: zookeeper:3.5

container_name: zoo1

restart: always

hostname: zoo1

ports:

- 2181:2181

environment:

ZOO_MY_ID: 1

ZOO_SERVERS: server.1=0.0.0.0:2888:3888;2181

ZOO_4LW_COMMANDS_WHITELIST: mntr,conf,ruok

networks:

- solr

networks:

solr:

创建collection

等待solr cloud起来后，使用默认的configset（_default）创建collection。

使用Postman之类的工具发送HTTP命令（下文中的HTTP命令都是用Postman之类的工具发送的，但是这些命令在Solr Admin Console上也一样可以执行）：

POST http://localhost:8981/solr/admin/collections?action=CREATE&name=collection1&numShards=1&replicationFactor=1

现在我们创建了一个collection，它的信息：

name：collection1

shard num：1

replication num: 1 (per shard)

虽然对下文的测试可能没有影响，但是说明一下：这时我创建了一个单shard单replica的collection，这个single mode启动的solr是不一样的。

我有2个solr node，只有一个node上产生了index目录。也就是说，如果我在这个collection上query，便只有产生replica的那个node会收到真实需要处理的query request（另一台想做也做不了）。

Schema的准备

下文中都使用默认的dynamicField，所以不额外创建field。

默认的dynamicField中，*_i是int，*_s是string，*_ss是多值string。

普通Join

示例1-1：from单值string to单值string （单shard）

index数据准备

执行index命令：

POST http://localhost:8981/solr/collection1/update?commit=true&overwrite=true

[

{"id":"001", "single_s":"001", "other_s":"aa"},

{"id":"002", "single_s":"003", "other_s":"aa"},

{"id":"003", "single_s":"005", "other_s":"aa"},

{"id":"004", "single_s":"004", "other_s":"bb"},

{"id":"005"}

]

测试

GET http://localhost:8981/solr/collection1/select?fq={!join to=id from=single_s}other_s:aa&q=*:*&fl=id

001

003

005

解析：

other_s:aa 返回id值为001/002/003

{!join to=id from=single_s}other_s:aa 翻译成sql类似于select * from collection1 where id in (select single_s from collection1 where other_s contains 'aa'),返回id为001/003/005

最后结果：返回001/003/005

示例1-2：from单值string to单值string （多shard）

创建的collection的shard num为2的情况下，其它index以及query语句都和示例1-1一样。返回结果：

001

003

丢了005！

请注意：这个结果并不一定是固定的。取决于shard的index路由规则。

这里是因为：solr在单个replic上进行结算，即select * from collection1 where id in (select single_s from collection1 where other_s contains 'aa')这一条语句，是在单个shard上结算的。

因为shard数为2，分别获取两个shard上的doc id：

GET http://localhost:8981/solr/collection1/select?q=*:*&shards=shard1&fl=id

得到：

001

005

GET http://localhost:8981/solr/collection1/select?q=*:*&shards=shard2&fl=id

得到：

002

003

004

于是上述所说的where doc1.single_s=doc2.id and doc1.other_s='aa'分别在两个shard上结算。

select * from collection1.shard1 where id in (select single_s from collection1.shard1 where other_s contains 'aa')仅返回了001

select * from collection1.shard2 where id in (select single_s from collection1.shard2 where other_s contains 'aa')仅返回了003

最后总的结果就是001和003。

005是怎么丢掉的？

因为other_s:aa在shard2上得到了id为003的结果，它的single_s为005，它试图在shard2上找寻doc id为005的项，但并没有找到，所以这条记录就被抛弃了。

示例1-3：from number to number

情况和示例1-1其实是一样的。index数据如下：

[

{"id":"001", "single_i":11,"single_2_i":11, "other_s":"aa"},

{"id":"002", "single_i":22,"single_2_i":33, "other_s":"aa"},

{"id":"003", "single_i":33,"single_2_i":77, "other_s":"aa"},

{"id":"004", "single_i":44,"single_2_i":44, "other_s":"bb"},

{"id":"005", "single_i":55,"single_2_i":66, "other_s":"aa"},

{"id":"006", "single_i":66}

]

query语句：

GET http://localhost:8981/solr/collection1/select?fq={!join to=single_i from=single_2_i}other_s:aa&q=*:*&fl=id

001

003

006

示例1-4：from string to number

没有返回，也不报错。

示例1-5：from number to string

抛异常。

ERROR (qtp2048537720-22) [c:collection1 s:shard1 r:core_node3 x:collection1_shard1_replica_n1] o.a.s.s.HttpSolrCall null:java.lang.IllegalStateException: unexpected docvalues type SORTED for field 'single_s' (expected one of [SORTED_NUMERIC, NUMERIC]). Re-index with correct docvalues type.

at org.apache.lucene.index.DocValues.checkField(DocValues.java:317)

at org.apache.lucene.index.DocValues.getSortedNumeric(DocValues.java:389)

at org.apache.solr.search.join.GraphPointsCollector.doSetNextReader(GraphPointsCollector.java:50)

at org.apache.lucene.search.SimpleCollector.getLeafCollector(SimpleCollector.java:33)

at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:652)

at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)

at org.apache.solr.search.JoinQuery$JoinQueryWeight.getDocSet(JoinQParserPlugin.java:387)

at org.apache.solr.search.JoinQuery$JoinQueryWeight.scorer(JoinQParserPlugin.java:311)

示例1-6：from多值string to单值string （单shard）

index数据准备

执行index命令：

POST http://localhost:8981/solr/collection1/update?commit=true&overwrite=true

[

{"id":"001", "multi_ss":["001", "002"]},

{"id":"002", "multi_ss":["003"]},

{"id":"003", "multi_ss":["006", "005"]},

{"id":"004", "multi_ss":["005"]},

{"id":"005"}

]

测试

GET http://localhost:8981/solr/collection1/select?fq={!join to=id from=multi_ss }*:*&q=*:*&fl=id

001

002

003

005

和1-1的where子句其实是一样的，只是这里的multi_ss有了多个值了。比如id为001的doc，它的multi_ss指向了001和002两个doc。

跨Collection/Core Join

其实这和普通的Join来自同一个实现类：org.apache.solr.search.JoinQParserPlugin。

它能支持的参数本来就包括

final String fromField = qparser.getParam("from");

final String fromIndex = qparser.getParam("fromIndex");

final String toField = qparser.getParam("to");

final String v = qparser.localParams.get(QueryParsing.V);

只是在没有fromIndex时，默认把fromIndex设置为当前core（在solr cloud情况下，就是shard的一个replica）。

所以它的情况和普通Join都是一样的。

但它不同的情况是，它对于from的collection有了更多的要求 (to的collection没有要求)。

单shard （这个情况和上文中的普通Join是相似的，不报错，但是返回结果不正确，比应该返回的doc少。）

shard在每个solr node上都有relica （实际上是to collection shard所在的所有nodes），否则就报错！

它报错的信息如下：

"error":{

"metadata":[

"error-class",

"org.apache.solr.common.SolrException",

"root-error-class",

"org.apache.solr.common.SolrException",

"error-class",

"org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException",

"root-error-class",

"org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException"

"msg":"Error from server at null: SolrCloud join: No active replicas for collection1 found in node 172.23.0.3:8983_solr",

"code":400

}

为什么普通Join没有第二项这个要求？

因为一个shard已经被分配到了某个join query，代表它已经存在在当前node了啊。。。

示例2-1：from多值string to单值string （from单shard to 多shard）

创建新的collection

重建环境，创建两个collection。

POST http://localhost:8981/solr/admin/collections?action=CREATE&name=collection1&numShards=1&replicationFactor=2

现在我们创建了一个collection，它的信息：

name：collection1

shard num：1

replication num: 2 (per shard)

由于我们拉起来的solr node就是2个，默认创建时，会平摊replica，所以这里创建后，collection1就会在两个solr node上分别有一个shard的replica。

POST http://localhost:8981/solr/admin/collections?action=CREATE&name=collection2&numShards=2&replicationFactor=1

现在我们创建了一个新collection，它的信息：

name：collection2

shard num：2

replication num: 1 (per shard)

我们有了2个collection：

collection1：单shard，2 replica/shard

collection2：多shard, 1 replica/shard

index数据准备

执行index命令：

POST http://localhost:8981/solr/collection1/update?commit=true&overwrite=true

[

{"id":"b001", "multi_ss":["001", "002"]},

{"id":"b002", "multi_ss":["003"]},

{"id":"c003", "multi_ss":["006", "005"]}

]

POST http://localhost:8981/solr/collection2/update?commit=true&overwrite=true

[

{"id":"001"},

{"id":"002"},

{"id":"003"},

{"id":"004"},

{"id":"005"}

]

测试

GET http://localhost:8981/solr/collection2/select?fq={!join fromIndex=collection1 to=id from=multi_ss}id:b00*&q=*:*&fl=id

这里翻译一下这条Join就类似于：

select * from collection2 where id in (select single_s from collection1 where other_s contains 'aa')

001

002

003

跨多个collection/Core的Join

你肯定要问我如果要连级跳怎么办？从collection1搜索后join到collection2，然后再从collection2 join到collection3，可以吗？

当然可以了。

不过这就超出了一条Join语句所能支持的范围了。

你需要subQuery功能。

相关探索

恋爱保险都有哪些，如何选择适合自己的保险方案

为什么浏览器打不开网址链接

如何放大或缩小网页上的内容？

自然伙伴