apache-spark - 访问Spark Mllib二等分K均值树数据

原文 标签 apache-spark apache-spark-mllib

Accessing Spark Mllib Bisecting K-means tree data

Looking over the source code for Bisecting K-means it seems that it builds an internal tree representation of the cluster assignments at each level it progresses. Is it possible to get access to that tree? The built-in methods only give the cluster assignment at the leafs and not the nodes.

Answer

Follow up on this: has anyone modified the Spark ML source code to be able to store & return the hierarchical clustering tree structure?

I found a GitHub repo with intro to MLlib 1.6's implementation of Bisecting K-means Clustering: https://github.com/yu-iskw/bisecting-kmeans-blog/blob/master/blog-article.md

In the section "What's Next?", the first JIRA ticket [SPARK-11664] "Add methods to get bisecting k-means cluster structure" (https://issues.apache.org/jira/browse/SPARK-11664) seems to be the request to obtain the hierarchical cluster tree structure as a built-in effort. As of today, this ticket status is marked as "resolved".

However, in Spark MLlib's latest implementation (2.4.4) as follows, we didn't find this tree structure, or dendrogram to be a built-in output:

PySpark MLlib 2.4.4 official documentation: https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.BisectingKMeans https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.BisectingKMeansModel

Scala MLlib 2.4.4 official documentation: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel

We also looked up into their source code, and it does not seem to have the hierarchical tree structure stored as built-in output?

If the hierarchical clustering tree structure is not available in Spark MLlib 2.4.4 BisectingKMeans, does anyone know if there's modified the source code to get the tree structure available?

Thanks!

翻译

查看“平分K均值”的源代码,似乎它在进行的每个级别上建立了集群分配的内部树表示。是否可以访问该树?内置方法仅在叶而不是节点处提供群集分配。
最佳答案
对此进行跟进:是否有人修改了Spark ML源代码以能够存储和返回分层聚类树结构?

我找到了一个GitHub存储库,其中介绍了MLlib 1.6的Bisecting K-means集群实现:https://github.com/yu-iskw/bisecting-kmeans-blog/blob/master/blog-article.md

在“下一步是什么?”部分中,第一个JIRA票据[SPARK-11664]“添加用于获取平分k均值聚类结构的方法”(https://issues.apache.org/jira/browse/SPARK-11664)似乎是获得作为构建的分层聚类树结构的请求。 -努力。截至今天,该票证状态已标记为“已解决”。

但是,在以下Spark MLlib的最新实现(2.4.4)中,我们没有发现此树结构或树状图是内置输出:

PySpark MLlib 2.4.4官方文档:
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.BisectingKMeans
https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.BisectingKMeansModel

Scala MLlib 2.4.4官方文档:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel

我们还查看了他们的源代码,似乎没有将分层树结构存储为内置输出?

如果Spark MLlib 2.4.4 BisectingKMeans中没有分层聚类树结构,是否有人知道是否已修改源代码以使树结构可用?

谢谢!
相关推荐

python - 从PySpark中的工作程序节点访问ADLS上二进制文件的最有效方法?

apache-spark - 为什么在本地模式下加入Spark太慢了?

scala - 如何在Spark中为不同的文件名调用单独的逻辑

apache-spark - Spark和InfiniBand

java - 使用Spark SQL时找不到Spark Logging类

apache-spark - Spark 2.0独立模式动态资源分配工作者启动错误

java - 将Json的Dataset列解析为Dataset <Row>

java - Spark数据帧加入范围缓慢

scala - 计算Spark中UDF的调用

scala - 火花一次输出到kafka