java - 如何将Dataset <Tuple2 <String,DeviceData >>转换为Iterator <DeviceData>

原文 标签 java apache-spark apache-spark-2.0 apache-spark-dataset

How to transform Dataset<Tuple2<String,DeviceData>> to Iterator<DeviceData>

I have Dataset<Tuple2<String,DeviceData>> and want to transform it to Iterator<DeviceData>.

Below is my code where I am using collectAsList() method and then getting Iterator<DeviceData>.

Dataset<Tuple2<String,DeviceData>> ds = ...;
List<Tuple2<String, DeviceData>> listTuple = ds.collectAsList();

ArrayList<DeviceData> myDataList = new ArrayList<DeviceData>();
for(Tuple2<String, DeviceData> tuple : listTuple){
    myDataList.add(tuple._2());
}

Iterator<DeviceData> myitr = myDataList.iterator();

I cannot use collectAsList() as my data is huge and it will hamper performance. I looked into Dataset API but couldn't get any solution. I googled it but couldn't find any answer. Can someone please guide me? If the solution is in java that will be great. Thanks.

EDIT :

DeviceData class is simple javabean. Here is printSchema() output for ds.

root
 |-- value: string (nullable = true)
 |-- _2: struct (nullable = true)
 |    |-- deviceData: string (nullable = true)
 |    |-- deviceId: string (nullable = true)
 |    |-- sNo: integer (nullable = true)
Answer

You can directly extract DeviceData from ds instead of collecting and building again.

Java:

Function<Tuple2<String, DeviceData>, DeviceData> mapDeviceData =
    new Function<Tuple2<String, DeviceData>, DeviceData>() {
      public DeviceData call(Tuple2<String, DeviceData> tuple) {
        return tuple._2();
      }
    };

Dataset<DeviceData> ddDS = ds.map(mapDeviceData) //extracts DeviceData from each record

Scala:

val ddDS = ds.map(_._2) //ds.map(row => row._2)

翻译

我有Dataset<Tuple2<String,DeviceData>>并想将其转换为Iterator<DeviceData>

以下是我使用collectAsList()方法然后获取Iterator<DeviceData>的代码。

Dataset<Tuple2<String,DeviceData>> ds = ...;
List<Tuple2<String, DeviceData>> listTuple = ds.collectAsList();

ArrayList<DeviceData> myDataList = new ArrayList<DeviceData>();
for(Tuple2<String, DeviceData> tuple : listTuple){
    myDataList.add(tuple._2());
}

Iterator<DeviceData> myitr = myDataList.iterator();


我无法使用collectAsList(),因为我的数据量很大,这会影响性能。我调查了Dataset API,但没有任何解决方案。我用谷歌搜索,但找不到任何答案。有人可以指导我吗?如果解决方案是在Java中,那就太好了。谢谢。

编辑:

DeviceData类是简单的javabean。这是ds的printSchema()输出。

root
 |-- value: string (nullable = true)
 |-- _2: struct (nullable = true)
 |    |-- deviceData: string (nullable = true)
 |    |-- deviceId: string (nullable = true)
 |    |-- sNo: integer (nullable = true)
最佳答案
您可以直接从DeviceData中提取ds,而无需再次收集和构建。

Java:

Function<Tuple2<String, DeviceData>, DeviceData> mapDeviceData =
    new Function<Tuple2<String, DeviceData>, DeviceData>() {
      public DeviceData call(Tuple2<String, DeviceData> tuple) {
        return tuple._2();
      }
    };

Dataset<DeviceData> ddDS = ds.map(mapDeviceData) //extracts DeviceData from each record


Scala:

val ddDS = ds.map(_._2) //ds.map(row => row._2)
相关推荐

java - 如何将Spring MockMVC与自定义Spring Security WebSecurityConfigurerAdapter一起使用

java - GATE对NLP有多好?

java - 从Java中的多个文件读取分散的数据

java - Eclipse Lombok批注未编译…为什么?

java - Android属性已经定义

java - 超时阻止在ListenableFuture上

java - Android测试不能在低于5(API 21)NoClassDefFoundError的设备上运行

java - DateTimeFormatter不适用于本地语言环境的LLLL模式

java - 点燃C++和缓存关联性

java - 用于JavaFX Scene Builder的IntelliJ IDEA插件拖放操作停止