机器学习大小单双倍投公式通常由转换管道组成。虽然这种设计允许在训练时有效地执行单个大小单双倍投公式组件,但预测服务具有不同的要求,例如低负载,高吞吐量和在重负载下的优雅性能降级。当前预测服务系统将大小单双倍投公式视为黑盒子,其中特定于预测时间的优化被忽略,有利于易于部署。在本文中,我们介绍PRETZEL,一个预测服务系统,引入一种新颖的白盒结构,支持端到端和多大小单双倍投公式优化。使用类似生产的大小单双倍投公式管道,实验表明PRETZEL能够在不同的维度上引入性能改进;与最先进的方法相比,PRETZEL平均能够将第99百分位延迟减少5.5倍,同时将内存足迹减少25倍,并将吞吐量提高4.7倍。
translated by 谷歌翻译
Spark SQL is a new module in Apache Spark that integrates rela-tional processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g., declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g., machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g., schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself , offering richer APIs and optimizations while keeping the benefits of the Spark programming model.
translated by 谷歌翻译
本报告描述了18个项目,这些项目探讨了如何在国家实验室中将商业云计算服务用于科学计算。这些演示包括在云环境中部署专有软件,以利用已建立的基于云的分析工作流来处理科学数据集。总的来说,这些项目非常成功,并且他们共同认为云计算可以成为国家实验室科学计算的宝贵计算资源。
translated by 谷歌翻译
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. Tensor-Flow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with a focus on training and inference on deep neural networks. Several Google services use TensorFlow in production , we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that Tensor-Flow achieves for several real-world applications.
translated by 谷歌翻译
大小单双倍投公式 THE GROWTH OF data volumes in industry and research poses tremendous opportunities, as well as tremendous computational challenges. As data sizes have outpaced the capabilities of single machines, users have needed new systems to scale out computations to multiple nodes. As a result, there has been an explosion of new cluster programming models targeting diverse computing workloads. 1,4,7,10 At first, these models were relatively specialized, with new models developed for new workloads; for example, MapReduce 4 supported batch processing, but Google also developed Dremel 13 for interactive SQL queries and Pregel 11 for iterative graph algorithms. In the open source Apache Hadoop stack, systems like Storm 1 and Impala 9 are also specialized. Even in the relational database world, the trend has been to move away from "one-size-fits-all" systems. 18 Unfortunately, most big data applications need to combine many different processing types. The very nature of "big data" is that it is diverse and messy; a typical pipeline will need MapReduce-like code for data loading , SQL-like queries, and iterative machine learning. Specialized engines can thus create both complexity and inefficiency; users must stitch together disparate systems, and some applications simply cannot be expressed efficiently in any engine. In 2009, our group at the University of California, Berkeley, started the Apache Spark project to design a unified engine for distributed data processing. Spark has a programming model similar to MapReduce but extends it with a data-sharing abstraction called "Resilient Distributed Da-tasets," or RDDs. 25 Using this simple extension, Spark can capture a wide range of processing workloads that previously needed separate engines, including SQL, streaming, machine learning, and graph processing 2,26,6 (see Figure 1). These implementations use the same optimizations as specialized engines (such as column-oriented processing and incremental updates) and achieve similar performance but run as libraries over a common engine , making them easy and efficient to compose. Rather than being specific Apache Spark: A Unified Engine for Big Data Processing key insights ? A simple programming model can capture streaming, batch, and interactive workloads and enable new applications that combine them. ? Apache Spark applications range from finance to scientific data processing and combine libraries for SQL, machine learning, and graphs. ? In six years, Apache Spark has grown to 1,000 contributors and thousands of deployments.
translated by 谷歌翻译
The proliferation of massive datasets combined with the development of sophisticated analytical techniques has enabled a wide variety of novel applications such as improved product recommendations , automatic image tagging, and improved speech-driven interfaces. A major obstacle to supporting these predictive applications is the challenging and expensive process of identifying and training an appropriate predictive model. Recent efforts aiming to automate this process have focused on single node implementations and have assumed that model training itself is a black box, limiting their usefulness for applications driven by large-scale datasets. In this work, we build upon these recent efforts and propose an architecture for automatic machine learning at scale comprised of a cost-based cluster resource allocation estimator, advanced hyper-parameter tuning techniques, bandit resource allocation via runtime algorithm introspection, and physical optimization via batching and optimal resource allocation. The result is TUPAQ, a component of the MLbase system that automatically finds and trains models for a user's predictive application with comparable quality to those found using exhaustive strategies, but an order of magnitude more efficiently than the standard baseline approach. TUPAQ scales to models trained on Terabytes of data across hundreds of machines.
translated by 谷歌翻译
机器学习工作流程开发是一个反复试验的过程:开发人员通过测试小的修改来迭代工作流程,直到达到所需的准确性。不幸的是,现有的机器学习系统只关注大小单双倍投公式训练 - 只占整个开发时间的一小部分 - 而忽略了解决迭代开发问题。我们建议使用Helix,amachine学习系统来优化执行情况 - 智能地缓存和重用,或者重新计算中间体。 Helix在其斯卡拉DSL中捕获了各种各样的应用程序需求,其简洁的语法定义了数据处理,大小单双倍投公式规范和学习的统一过程。我们证明了重用问题可以被转换为Max-Flow问题,而缓存问题则是NP-Hard。我们为后者开发有效的轻量级启发式算法。 Empiricalevaluation显示Helix不仅能够在一个统一的工作流程中处理各种各样的用例,而且速度更快,在四个实际上提供比最先进系统(如DeepDive或KeystoneML)高达19倍的运行时间减少。世界在自然语言处理,计算机视觉,社会和自然科学中的应用。
translated by 谷歌翻译
scikit-learn is an increasingly popular machine learning library. Written in Python, it is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts. In this paper, we present and discuss our design choices for the application programming interface (API) of the project. In particular, we describe the simple and elegant interface shared by all learning and processing units in the library and then discuss its advantages in terms of composition and reusabil-ity. The paper also comments on implementation details specific to the Python ecosystem and analyzes obstacles faced by users and developers of the library.
translated by 谷歌翻译
The World Wide Web has grown to be a primary source of information for millions of people. Due to the size of the Web, search engines have become the major access point for this information. However, "commercial" search engines use hidden algorithms that put the integrity of their results in doubt, collect user data that raises privacy concerns, and target the general public thus fail to serve the needs of specific search users. Open source search, like open source operating systems, offers alternatives. The goal of the Open Source Information Retrieval Workshop (OSIR) is to bring together practitioners developing open source search technologies in the context of a premier IR research conference to share their recent advances, and to coordinate their strategy and research plans. The intent is to foster community-based development, to promote distribution of transparent Web search tools, and to strengthen the interaction with the research community in IR. A workshop about Open Source Web Information Retrieval was held last year in Compigne, France as part of WI 2005. The focus of this worksop is broadened to the whole open source information retrieval community. We want to thank all the authors of the submitted papers, the members of the program committee:, and the several reviewers whose contributions have resulted in these high quality proceedings. ABSTRACT There has been a resurgence of interest in index maintenance (or incremental indexing) in the academic community in the last three years. Most of this work focuses on how to build indexes as quickly as possible, given the need to run queries during the build process. This work is based on a different set of assumptions than previous work. First, we focus on latency instead of through-put. We focus on reducing index latency (the amount of time between when a new document is available to be indexed and when it is available to be queried) and query latency (the amount of time that an incoming query must wait because of index processing). Additionally, we assume that users are unwilling to tune parameters to make the system more efficient. We show how this set of assumptions has driven the development of the Indri index maintenance strategy, and describe the details of our implementation.
translated by 谷歌翻译
Dryad is a general-purpose distributed execution engine for coarse-grain data-parallel applications. A Dryad application combines computational "vertices" with communication "channels" to form a dataflow graph. Dryad runs the application by executing the vertices of this graph on a set of available computers, communicating as appropriate through files, TCP pipes, and shared-memory FIFOs. The vertices provided by the application developer are quite simple and are usually written as sequential programs with no thread creation or locking. Concurrency arises from Dryad scheduling vertices to run simultaneously on multiple computers, or on multiple CPU cores within a computer. The application can discover the size and placement of data at run time, and modify the graph as the computation progresses to make efficient use of the available resources. Dryad is designed to scale from powerful multi-core single computers, through small clusters of computers, to data centers with thousands of computers. The Dryad execution engine handles all the difficult problems of creating a large distributed, concurrent application: scheduling the use of computers and their CPUs, recovering from communication or computer failures, and transporting data between ver-tices.
translated by 谷歌翻译
We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.
translated by 谷歌翻译
Predictive models based on machine learning can be highly sensitive to data error. Training data are often combined from a variety of different sources, each susceptible to different types of inconsistencies , and as new data stream in during prediction time, the model may encounter previously unseen inconsistencies. An important class of such inconsistencies are domain value violations that occur when an attribute value is outside of an allowed domain. We explore automatically detecting and repairing such violations by leveraging the often available clean test labels to determine whether a given detection and repair combination will improve model accuracy. We present BoostClean which automatically selects an ensemble of error detection and repair combinations using statistical boosting. BoostClean selects this ensemble from an extensible library that is pre-populated general detection functions, including a novel detector based on the Word2Vec deep learning model, which detects errors across a diverse set of domains. Our evaluation on a collection of 12 datasets from Kaggle, the UCI repository, real-world data analyses, and production datasets that show that Boost-Clean can increase absolute prediction accuracy by up to 9% over the best non-ensembled alternatives. Our optimizations including parallelism, materialization, and indexing techniques show a 22.2× end-to-end speedup on a 16-core machine.
translated by 谷歌翻译
Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructure, wearable assistance, and Internet of Things, continuous data streams must be processed under very short delays. Several solutions, including multiple software engines, have been developed for processing unbounded data streams in a scalable and efficient manner. More recently, architecture has been proposed to use edge computing for data stream processing. This paper surveys state of the art on stream processing engines and mechanisms for exploiting resource elasticity features of cloud computing in stream processing. Resource elasticity allows for an application or service to scale out/in according to fluctuating demands. Although such features have been extensively investigated for enterprise applications, stream processing poses challenges on achieving elastic systems that can make efficient resource management decisions based on current load. Elasticity becomes even more challenging in highly distributed environments comprising edge and cloud computing resources. This work examines some of these challenges and discusses solutions proposed in the literature to address them.
translated by 谷歌翻译
许多流行的系统,尤其是谷歌的TensorFlow,都是从头开始实现的,以支持机器学习任务。我们考虑如何对现代关系数据库管理系统(RDBMS)进行一小组更改,使其适用于分布式学习计算。更改包括为递归提供更好的支持,以及优化和执行非常大的计算计划。我们还表明使用RDBMS作为机器学习平台具有关键优势。特别是,基于数据库管理系统的学习允许对大型数据集,特别是大型大小单双倍投公式进行平凡缩放,其中不同的计算单元在大小单双倍投公式的不同部分上操作,该大小单双倍投公式可能太大而不适合RAM。
translated by 谷歌翻译
在本文中,我们为python数据科学生态系统提供了MLaut(机器学习自动化工具箱)。 MLaut自动对大量数据集上的机器学习算法进行大规模评估和基准测试.MLaut为机器算法算法提供了高级工作流程接口,实现了数据集集合,训练算法和实验结果数据库的本地后端,并为scikit-learn和keras建模库提供易于使用的界面。实验很容易在几行代码中使用默认设置进行设置,同时保持完全可以根据超参数调整,流水线组合或深度学习架构的水平进行设置。作为一个主要的测试案例MLaut,我们才能进行大规模supervisedclassification研究,以基准一批ofmachine学习算法的性能 - 据我们所知,还标准第一较大scalestudy监督学习数据集,包括深learningalgorithms。虽然证实了文献中的许多先前发现,但我们发现(在我们的研究的限制内)深度神经网络在基本监督学习上不能很好地执行,即在更专业的,基于图像,音频或基于文本的任务之外。
translated by 谷歌翻译
MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java library that makes it easy to develop, test, and run efficient data-parallel pipelines. At the core of the FlumeJava library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently , FlumeJava defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation , deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google.
translated by 谷歌翻译
人工智能(AI)有机会彻底改变美国国防部(DoD)和情报界(IC)应对不断演变的威胁,数据泛滥和快速行动过程的挑战。开发端到端的人工智能系统需要并行开发不同的部分,这些部分必须协同工作才能提供决策者,作战人员和分析师可以使用的能力。这些部分包括数据收集,数据调节,算法,计算,强大的人工智能和人机组合。虽然今天流行的媒体围绕着算法和计算的进步,但大多数现代人工智能系统都利用了许多不同领域的进步。而且,虽然某些组件可能不像其他组件那样对最终用户可见,但我们的经验表明,这些组件中的每一个都是相互关联的。组件在AI系统的成功或失败中起着重要作用。本文旨在重点介绍端到端AI系统中涉及的许多这些技术。本文的目的是为读者提供学术界,行业界和政府的学术,技术细节和最新亮点的概述。在可能的情况下,我们会指出可用于进一步阅读和理解的相关资源。
translated by 谷歌翻译
Big data: A survey
分类:
translated by 谷歌翻译