python - 从PySpark中的工作程序节点访问ADLS上二进制文件的最有效方法?

原文 标签 python apache-spark pyspark azure-data-lake

Most efficient way to access binary files on ADLS from worker node in PySpark?

I have deployed an Azure HDInsight cluster with rwx permissions for all directories on the Azure Data Lake Store that also serves as its storage account. On the head node, I can load e.g. image data from the ADLS with a command like:

my_rdd = sc.binaryFiles('adl://{}.azuredatalakestore.net/my_file.png')

Workers do not have access to the SparkContext's binaryFiles() function. I can use the azure-datalake-store Python SDK to load the file, but this seems to be much slower. I assume because it realizes none of the benefits of the association between the cluster and the ADLS.

Is there a faster way to load files from an associated ADLS on workers?

Further context if needed:

I am using PySpark to apply a trained deep learning model to a large collection of images. Since the model takes a long time to load, my ideal would be:

  • Send each worker a partial list of image URIs to process (by applying mapPartition() to an RDD containing the full list)
  • Have the worker load data for one image at a time for scoring with the model
  • Return the model's results for the set of images

Since I don't know how to load the images efficiently on workers, my best bet at the moment is to partition an RDD containing the image byte data, which (I assume) is memory-inefficient and creates a bottleneck by having the head node do all of the data loading.

Answer

The primary storage of the HDInsight cluster is simply available as the HDFS root.

hdfs dfs -ls /user/digdug/images/
Found 3 items
-rw-r--r--   1    digdug supergroup       4957 2017-01-24 07:59 /user/digdug/images/a.png
-rw-r--r--   1    digdug supergroup       4957 2017-01-24 07:59 /user/digdug/images/b.png
-rw-r--r--   1    digdug supergroup       1945 2017-01-24 08:01 /user/digdug/images/c.png

In pyspark:

rdd = sc.binaryFiles("/user/digdug/images")

def f(iterator):
    sizes = []
    for i in iterator:
        sizes.append(len(i[1]))
    return sizes

rdd.mapPartitions(f).collect()

outputs:

[4957, 4957, 1945]

翻译

我已经为Azure Data Lake Store上的所有目录部署了具有rwx权限的Azure HDInsight群集,该群集也用作其存储帐户。在头节点上,我可以加载例如来自ADLS的图像数据,其命令如下:

my_rdd = sc.binaryFiles('adl://{}.azuredatalakestore.net/my_file.png')


工作人员无权访问SparkContext的binaryFiles()函数。我可以使用azure-datalake-store Python SDK加载文件,但这似乎要慢得多。我假设是因为它没有意识到群集与ADLS之间关联的任何好处。

有没有一种更快的方法可以从工作程序上的关联ADLS加载文件?

如有需要,请提供其他背景信息:

我正在使用PySpark将训练有素的深度学习模型应用于大量图像。由于该模型需要很长时间才能加载,因此我的理想选择是:


向每个工作人员发送要处理的图像URI的部分列表(通过将mapPartition()应用于包含完整列表的RDD)
让工作人员一次加载一张图像的数据以对模型进行评分
返回一组图像的模型结果


由于我不知道如何高效地将图像加载到工作程序上,因此,目前最好的选择是对包含图像字节数据的RDD进行分区,(我认为)这是内存效率低下的,并且由于存在头节点而造成瓶颈完成所有数据加载。
最佳答案
HDInsight群集的主存储可以简单地用作HDFS根目录。

hdfs dfs -ls /user/digdug/images/
Found 3 items
-rw-r--r--   1    digdug supergroup       4957 2017-01-24 07:59 /user/digdug/images/a.png
-rw-r--r--   1    digdug supergroup       4957 2017-01-24 07:59 /user/digdug/images/b.png
-rw-r--r--   1    digdug supergroup       1945 2017-01-24 08:01 /user/digdug/images/c.png


在pyspark中:

rdd = sc.binaryFiles("/user/digdug/images")

def f(iterator):
    sizes = []
    for i in iterator:
        sizes.append(len(i[1]))
    return sizes

rdd.mapPartitions(f).collect()


输出:

[4957, 4957, 1945]
相关推荐

python - 在Keras中增强多通道图像的hacky方法

python - Tensorflow feed_dict键不能解释为Tensor

python - 将Python的grpcio模块嵌入到Bazel项目中

python - Python 3-复数

javascript - 散景-如何使HoverTool工具提示紧贴点击点?

python - 如果没有适当的比较器,您能否*读取* leveldb数据?

python - S3 boto库:如何对存储桶中的Key进行HEAD请求

python - 对多个Django项目使用单个Celery Server

php - 在PHP中,Python的* Args和** kwargs等价的是什么? [重复]

python - 如何在AWS API Gateway中正确映射我的响应