Fast Forward Labs Registration and Sign Up Information | fastforwardlabs

Fast Forward Labs Dispatches

Your subscription to our list has been confirmed.

For your records, here is a copy of the information you submitted to us...

Email Address: @
First Name: Allen
Last Name: Thompson

If at any time you wish to stop receiving our emails, you can:
unsubscribe here

You may also contact us at:
admin@fastforwardlabs.com

Updates from FFL on new papers, articles, and exciting developments

View this email in browser

Introducing Anomagram � An Interactive Visualization of Autoencoders, Built with Tensorflow.js

by Victor

Across many business use cases that generate data, it is frequently desirable to automatically identify data samples that deviate from �normal.� In many cases, these deviations are indicative of issues that need to be addressed. For example, an abnormally high cash withdrawal from a previously unseen location may be indicative of fraud. An abnormally high CPU temperature may be indicative of impending hardware failure.

The task of finding these anomalies is broadly referred to as Anomaly Detection, and many excellent approaches have been proposed (clustering-based approaches, nearest neighbors, density estimation, etc.). However, as data become high dimensional, with complex patterns, existing approaches (linear models which mostly focus on univariate data) can be unwieldy to apply. For such problems, deep learning can help.

In a recent post on Medium, I introduced Anomagram, an interactive visualization of how autoencoders can be applied to the task of anomaly detection. Anomagram is created as both a learning tool, and a prototype example of what an ML product interface could look like. The interface is built with Tensorflow.js and allows install-free experimentation in the browser.

The first part of the interface introduces important concepts (autoencoders, data transformations, thresholds, etc.) paired with appropriate interactive visualizations. The second part (pictured below) is geared towards more technical users and allows you to design, train, and evaluate an autoencoder model entirely in the browser.

Train a Model Module: Anomagram provides a direct manipulation interface that allows the user to specify a model (add/remove layers and units within layers), modify model parameters (training steps, batchsize, learning rate, regularizer, optimizer, etc), modify training/test data parameters (data size, data composition), train the model, and evaluate model performance (visualization of accuracy, precision, recall, false positive, false negative, ROC, etc. metrics) as each parameter is changed. The task is anomaly detection within ECG signal data.

If you�re interested in learning more about other deep learning approaches to anomaly detection, my colleagues and I will cover additional details on this topic in our upcoming report on Deep Learning for Anomaly Detection. (Please join us for a webinar on this topic on February 13th at 10:00am PT!)

In the meantime, you can read my full article on Medium, view the full demo of Anomagram here, and find the project source code here.

A Symbiotic Relationship: Knowledge Graphs & Machine Learning

by Andrew

For the past decade, humans have unknowingly come to depend on Knowledge Graphs on a daily basis. From personalized shopping recommendations to intelligent assistants and user-friendly search results, many of these accepted (and expected) features have come to fruition through the exploitation of knowledge graphs. Despite their longstanding conceptual and practical existence, knowledge graphs were just added to the Gartner Hype Cycle for Emerging Technologies in 2018 and have continued to garner attention as an area of active research and development for their distinct ability to represent real-world relationships.

In this article, we�ll take a high-level look at what knowledge graphs are and explore a few ways they interact with the field of machine learning.

What is a knowledge graph?

With all the hype comes confusion. In its simplest form, a knowledge graph is a set of data points linked by relations that describe a real-world domain. A cursory Google search will result in a myriad of explanations, but I believe there are a few core concepts that characterize a knowledge graph implementation.

cypher_graph_v1

Image Credit

It�s a graph - Contrary to traditional data stores, knowledge graphs are composed not only of entities, but also connections between each entity. In a graph network, these entities are called Nodes or Vertices and are connected together via Edges or Links. Graph data structures excel at modeling one-to-many relationships.
It provides context - Knowledge graphs glean semantic meaning by design - namely, the meaning of the data is implicitly encoded in the data representation itself, making it easy to query and explore. In the example above, we can quickly interpret that Jennifer is a Person who works for a Company called Neo4j because of the inherent directional metadata structure.
It�s intelligent - Knowledge graphs are built from dynamic, logical constructs - ontologies - that by default possess a framework supportive of inference. Regardless of the specific entities in the graph, the entity-to-entity connections hold fundamental meaning.

A Familiar Example

A concrete and relatable application of knowledge graph technology is demonstrated by Google�s Knowledge Graph which was launched in 2012 and has become a relied upon feature for all Google search users.

When searching for a specific person, Google provides users a side panel that contains relevant information surrounding the entity/subject in the query. This quick insight is made possible by Google�s Knowledge Graph - a pre-populated knowledge base of connected facts relating people, places, and things. Because the graph structure effectively represents this type of data by design, the facts seen above can be easily called upon to provide contextual insight.

Intersection of Machine Learning and Knowledge Graphs

Now that we have established a baseline intuition of what knowledge graphs are, let�s take a look at ways machine learning and knowledge graphs support each other.

Getting knowledge into a knowledge graph

Because knowledge graphs preserve relational information (and are therefore more complex than traditional data representations), the data they take in demands a more refined state. Specifically, the edges between nodes must be established and then wrangled into a complementary form before populating a graph.

Let�s imagine a hand-crafted graph describing characteristics of Sir Alex Ferguson as seen above. Defining these entities and relationships is a simple endeavor for anyone knowledgeable of the English Premier League (EPL) and organizing the connections upfront allows the graph to be efficiently queried later on. But what happens if we want to create subgraphs for every manager in the EPL? Or every soccer manager in the world? Or every professional sports manager that ever existed?

Manually identifying all of these relationships by hand is not scalable. This is where machine learning and Natural Language Processing (NLP) offer intelligent solutions to automatically curate raw data into useable facts. The general techniques involved include sentence segmentation, part of speech tagging, dependency parsing, word sense disambiguation, entity extraction, entity resolution, and entity linking applied to corpuses of both structured and unstructured data.

The simplified example above is intended to highlight the general NLP process on a single sentence. In practice, organizations use more advanced, patented systems built on these underlying techniques to automatically extract information, resolve conflicting entities, and populate millions of entities into production knowledge graphs.

Getting richer knowledge out of a knowledge graph

�Increasingly we�re learning that you can make better predictions about people by getting all the information from their friends and their friends� friends than you can from the information you have about the person themselves.�

? - James Fowler, Connected

The quote above poses a justified, but unconventional approach to predictive modeling. Traditional machine learning focuses on modeling tabular data that inherently cannot represent all of the cascading relationships found within networks and knowledge graphs. This often means data scientists are left trying to abstract, simplify, and even leave out predictive relationships baked into a knowledge graph�s structure. But what if features of every node in a knowledge graph could be derived from the context of all the nodes and edges around them?

There are few different methods for making use of connected features in machine learning, but a main area of attention is Knowledge Graph Embeddings (KGE). The goal of KGE�s is to learn a fixed vector space representation of any given node in a graph based on its nearby connections. Drawing a quick parallel to the Word2Vec algorithm (and concept of word embeddings) - where we learn a fixed vector representation for every word in a corpus based on nearby words - helps to frame the concept of KGEs. Specifically, the Node2Vec model expands upon ideas from Word2Vec by first randomly traversing subgraphs for each node in a network to build a large number of sequences [sentences]. Once we have a body of graph sequences [corpus], we can utilize Word2Vec methodology as it applies to text sequences to produce graph node embeddings.

Ultimately by learning embedding representations from the full context of a knowledge graph, we can extract deeply rich features to be used in downstream tasks. A few uses are:

Link prediction - Can we find nodes that are likely connected or are about to be connected? (For example, a graph of products and customers connected by orders could be used to predict (and thus recommend) which new products should likely be connected to which new customers.)
Supervised modeling - Embeddings can be fed as input features to supervised models for classification tasks.

Image Credit

Final Thoughts

Knowledge graphs are an effective tool for modeling interconnected, real-world scenarios while retaining contextual details that are not easily captured with traditional data structures. In this article, we explored two examples that demonstrate the symbiotic relationship between knowledge graphs and machine learning, which only scratches the surface of the intersection between the two technologies. Additional concepts - like Graph Neural Networks and ML driven Entity Resolution - stand as exciting areas of research and application.

Upcoming Events

Victor Dibia and Nisha Muktewar will be hosting a webinar on Deep Learning for Anomaly Detection on February 13th, in conjunction with the launch of our newest research report. Register today!
Victor and Nisha will also be presenting on Deep Learning for Anomaly Detection at the Strata Data Conference in San Jose on March 18th.

� 2020 Cloudera, Inc. All rights reserved.

395 Page Mill Rd., 3rd Floor, Palo Alto, CA 94306

Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera, Inc.
in the USA and other countries. All other trademarks are the property of their respective
companies. Information is subject to change without notice.

Terms & Conditions | Privacy Policy and Data Policy

To unsubscribe from future emails or to update your email preferences, click here.

.emailview

Updates from Cloudera Fast Forward on new research, porotypes, and exciting developments

View this email in browser

Welcome to the July edition of Cloudera Fast Forward's monthly newsletter. This month, we have new research, new blog posts, new recommended reading, and we're very excited to invite you to our next research webinar tomorrow - Deep Learning for Automated Question Answering!

New research: NLP for Question Answering

We're rounding out our blog series on Question Answering with these three recent posts.

Evaluating QA: the Retriever & the Full QA System

This post focuses on a vital component of a modern Information Retrieval-based (IR) QA system: the Retriever. Specifically, we introduce Elasticsearch as a powerful and efficient IR tool that can be used to scour through large corpora and retrieve relevant documents. We explain how to implement and evaluate a Retriever in the context of Question Answering and demonstrate its impact on an IR QA system.

How to Maximize Retriever Performance on a More Natural Dataset

Implementing question answering for real-world use cases is a bit more nuanced than evaluating system performance against a toy dataset. In this post, we explore several challenges faced by the Retriever when applying IR-QA to a more realistic dataset, as well as a few practical approaches for overcoming them.

Beyond SQuAD: How to Apply a Transformer QA Model to Your Data

Finally, we discuss how a Reader trained on SQuAD2.0 might perform on different datasets, particularly on highly specialized data - such as a collection of legal contracts, financial reports, or technical manuals. In this post we perform experiments designed to highlight how to adapt Transformer models to specialized domains and provide guidelines for practical applications.

If you're just catching up to our Question Answering series, check out the first three posts:

Intro to Automated Question Answering

Building a QA System with BERT on Wikipedia

Evaluating QA: Metrics, Predictions, and the Null Response

New on the blog

How to Explain HuggingFace BERT for Question Answering NLP Models with TF 2.0

Recently, our team at Fast Forward Labs have been exploring state of the art models for Question Answering and have used the rather excellent HuggingFace transformers library. As we applied BERT for QA models (BERTQA) to datasets outside of wikipedia (e.g legal documents), we have observed a variety of results. Naturally, one of the things we have been exploring are methods to better understand why the model provides certain responses, and especially when it fails. This post focuses on the following questions:

What are some approaches for explaining a BERT based model?
Why are Gradients a good approach?
How to implement Gradient explanations for BERT in Tensorflow 2.0?
Some example results and visualizations!

The post comes along with an interactive Colab notebook which you can try out! - Victor

Events

Webinar: Deep Learning for Automated Question Answering

July 30, 2020 10:00am PT | 1:00pm ET

What if you could ask your email client, "Who sent me the link with the latest financial report?" Automated question answering is a human-machine interaction to extract information from data using natural language. This general capability can take many forms, but one of the most exciting developments has been question answering from unstructured text data, including the massive amounts of information contained in emails, social media posts, blogs, log files, financial statements - and the list goes on. Thanks to a series of advances in deep learning techniques in the past two years, question answering capabilities have grown rapidly, and while still emerging, it's the perfect time to examine how this technology works, when it works well, and where it might still fall short.

In this webinar, we'll cover

General architecture of modern QA systems
Deep learning techniques for QA
Guidance on applying these techniques to a practical use case

We'll also do a live demonstration of a QA prototype that we built. You won't want to miss it, so we hope to see you there!

? 2020 Cloudera, Inc. All rights reserved.

395 Page Mill Rd., 3rd Floor, Palo Alto, CA 94306

Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera, Inc.
in the USA and other countries. All other trademarks are the property of their respective
companies. Information is subject to change without notice.

Terms & Conditions | Privacy Policy and Data Policy

To unsubscribe from future emails or to update your email preferences, click here.

.emailview

Updates from FFL on new papers, articles, and exciting developments

View this email in browser

Upcoming Interpretability Webinar!

We are pleased to announce that we will soon be releasing an updated edition of our report on Interpretability. In conjuction with the report release, please join us on April 9th for a webinar entitled: Opening the ML Black Box: Deploying Interpretable Models to Business Users. You can register here!

Bias in Knowledge Graphs - Part 1

by Keita

image credit: Mediamodifier from Pixabay

Introduction

This is the first part of a series to review Bias in Knowledge Graphs (KG). We aim to describe methods of identifying bias, measuring its impact, and mitigating that impact. For this part, we'll give a broad overview of this topic.

Motivation

Knowledge graphs, graphs with built-in ontologies, create unique opportunities for data analytics, machine learning, and data mining. They do this by enhancing data with the power of connections and human knowledge. Microsoft, Google, and Facebook actively use knowledge graphs in their products, and the interest from large and medium enterprises is accelerating. Andrew Reed gives a great overview of knowledge graphs in a previous article.

How are knowledge graphs used? Often they are deployed in the backend of an application, for example, supporting search results or responses from conversational AI. In other cases, knowledge graphs are used more directly to grow a knowledge base by finding or validating new information.

As the usage of this technology ramps up, bias in these systems becomes a problem that can contaminate results, degrading the user experience or driving bad decisions. In the last 1-2 years, interest has grown in identifying and removing bias.

Here are some hypothetical cases where bias in knowledge graphs could raise issues:

Conversational AI: Catherine, a college junior, interacts with a 'career bot', a conversational AI agent that offers job advice to graduating students. A knowledge graph based on the university's record of successful alumni underpins the AI agent. Catherine is a pre-med major with aspirations to become a surgeon. In the school's records, most successful surgeons are male. The conversational AI steers Catherine towards medical fields where there are historically more women.

image credit: bongkarn thanyakij from Pexels

Search: John is using a search engine to research vaccines. He is a layman with no deep knowledge of this area. The search results include hyperlinks and a sidebar of information and links generated from a large structured data source (based on "Wiki-Encyclopedia"). Wiki-Encyclopedia's article has been curated and updated by many people who have strong - but false - notions about the side-effects and efficacy of vaccines. As a result, when John reviews the search results and sidebar, he comes away with flawed - not well informed - notions about vaccines.

Knowledge Base Building: A hospital is building and expanding a knowledge graph. Part of this process involves algorithmically accepting or rejecting new 'facts' to add to the knowledge graph. If the foundational data is itself biased, it could lead to the machine rejecting legitimate facts that go against the bias of the foundational data.

Types of Bias

In general, our work is focused on bias that results in "systematic errors of judgment and decision making" by the consumers of KG & ML applications*.

Bias is a broad topic, which has many context-dependant definitions. Data scientists and statisticians are concerned with bias that is more technical and measurable, while less technical stakeholders may have their own definitions and standards for identifying when bias occurs.

Within the machine learning community, several types of bias have been identified and studied (Mehrabi, et. al. define 23 types of bias relevant to machine learning in a recent paper.)

Bias Along the ML/Analytical Pipeline

Aside from the types of bias, there are also places in the stages of an analytical or machine learning pipeline where bias can be identified.

Data. Structured and unstructured data form the raw materials for building knowledge graphs. This data can be crowd-sourced, as with Wikipedia and Amazon's Mechanical Turk, or it can be gathered and curated privately, as with a private corporation's records and transactions.

If data was generated by people with a prevalent opinion (self-selection bias) or from a majority of people of a certain cultural perspective (sometimes called representational or population bias), this can impact the downstream results. An example of self-selection bias is when customers who have strong motivations write service reviews. These may not reflect that majority of customers, but if a knowledge graph is built on top of such data, it may learn a distorted view of customer sentiment.

Semantic/Ontology. Ontologies are a framework of meaning which supports the input data and their relationships. Such frameworks are constructed top-down or bottoms-up, and can be manually designed or formed algorithmically. If built by a team of experts, conscious and representational bias can impact the structure of the ontology. If built by machine, bias in the underlying data can bleed into the ontology.

An example can be found in geographical ontologies. Anthropocentric biases lead designers to over emphasize human-centric locations versus natural ones. The Place branch of the DBpedia ontology (as of 2015), contained "dozens or even hundreds of classes for various sub-classes of restaurants, bars, and music venues, but only a handful of classes for natural features such as rivers" [Jancowicz].

Knowledge Graph Embeddings. Embeddings are lower-dimensional representations that enable more efficient processing of knowledge graph data, which is normally in a high-dimensional, and hard-to-wrangle form. It has recently been shown that social biases in knowledge graphs can get passed on to their respective embeddings [Fisher].

Inferential. Inference refers to when a query, machine learning algorithm, or fact-learning algorithm learns from a knowledge graph, or its embeddings. An oft-mentioned example is that of an inferential algorithm learning that only men can be the US President, because historically that has been the only case.

References

J. Fisher, Measuring Social Bias in Knowledge Graph Embeddings, Dec 2019.

K. Janowicz, et. al, Debiasing Knowledge Graphs: Why Female Presidents are not like Female Popes, Oct, 2018.

N. Mehrabi, et. al, A Survey of Bias and Fairness in Machine Learning, Sept 2019.

Notes

*Drawing from the definition in the K. Janowicz reference.

Our Recent Research

NLP for Question Answering

What if you could ask your email client, "Who sent me the link with the latest financial report?" Automated question answering is a human-machine interaction to extract information from data using natural language. This general capability can take many forms, but one of the most exciting developments has been question answering from unstructured text data, including the massive amounts of information contained in emails, social media posts, blogs, log files, financial statements - and the list goes on. Thanks to a series of advances in deep learning techniques in the past two years, question answering capabilities have grown rapidly, and while still emerging, it's the perfect time to examine how this technology works, when it works well, and where it might still fall short.

Blog: NLP for Question Answering
Prototype: Neural QA
Webinar: Deep Learning for Automated Question Answering

Causality for Machine Learning

Machine learning allows us to detect subtle correlations in large data sets, and use those correlations to make accurate predictions. However, these subtle correlations are often spurious - they exist only in a particular dataset - and the resultant model performs poorly, or gives unexpected results in the real world. Moreover, reasoning based on spurious correlations is dangerous. Business decisions should be based on things that are true, not things that are true only in a limited dataset. The trouble, of course, is identifying what is spurious and what is not. In this report, we explain how combining causal inference with machine learning can help us address these problems.

Report: Causality for Machine Learning
Prototype: Scene
Webinar: Causality for Machine Learning

Interpretability

Machine learning (ML) techniques like deep learning can deliver transformative business outcomes, yet the black-box nature of these approaches creates barriers of understanding that can slow adoption to a halt. ML model interpretability, or the ability to explain why and how a model makes a prediction, can enable enterprises to quickly understand predictive outcomes and confidently make decisions that optimize for future business results.

Report: Interpretability
Prototype: Refractor
Webinar: Opening the Machine Learning Black Box: Deploying Interpretable Models to Business Users

Events

Conference: Deep Learning for Anomaly Detection

Nisha Muktewar will be speaking about our research on Deep Learning for Anomaly Detection at Open Data Science Conference Europe on September 17th.

? 2020 Cloudera, Inc. All rights reserved.

395 Page Mill Rd., 3rd Floor, Palo Alto, CA 94306

Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera, Inc.
in the USA and other countries. All other trademarks are the property of their respective
companies. Information is subject to change without notice.

Terms & Conditions | Privacy Policy and Data Policy

To unsubscribe from future emails or to update your email preferences, click here.

.emailview

Updates from Cloudera Fast Forward on new research, prototypes, and exciting developments

View this email in browser

Welcome to the October edition of Cloudera Fast Forward's monthly newsletter. We're happy to share some our latest research, and invite you to our next webinar: tomorrow!

New research release!

Structural Time Series

Structural Time Series cover image

Time series data is ubiquitous, and forecasting has a long history. Generalized additive models give us a simple, flexible and interpretable means for modeling some kinds of time series, especially where there is seasonality. We look at the benefits and trade-offs of taking a curve-fitting approach to time series, and demonstrate its use via Facebook's Prophet library on a demand forecasting problem.

Our report, Structural Time Series, is freely available online, and accompanied by code applying the techniques discussed to forecasting electricity demand in California.

Research preview: Semantic Image Search

Within this research cycle, we will be revisiting the topic of semantic search on image data. We explore two critical requirements for semantic search at scale - strategies for creating semantic representations of images (supervised, unsupervised, semi supervised methods) and methods for fast approximate nearest neighbor search (e.g. FAISS). We will also be releasing an update to the well used ConvNet Playground App, and a set of scripts and tutorials for implementing semantic image search on the Cloudera Machine Learning platform.

Events

Our first ever Research Roundup webinar is tomorrow! Join us to hear about our two recent releases: meta-learning, and structural time series. If you can't make it, catch up on-demand later!

? 2020 Cloudera, Inc. All rights reserved.

5470 Great America Pkwy, Santa Clara, CA 95054

Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera, Inc.
in the USA and other countries. All other trademarks are the property of their respective
companies. Information is subject to change without notice.

Terms & Conditions | Privacy Policy and Data Policy

To unsubscribe from future emails or to update your email preferences, click here.

.emailview

Updates from Cloudera Fast Forward on new research, prototypes, and exciting developments

View this email in browser

Welcome to the December edition of Cloudera Fast Forward's monthly newsletter. We have a bumper pack of releases for the holiday season: a new research release, the open sourcing of three previous reports, and, as usual, our team's recommended reading for the month.

New research release!

Few-Shot Text Classification

BERT and Word2Vec discuss text classification

Text classification is a ubiquitous capability with a wealth of use cases including sentiment analysis, topic assignment, document identification, article recommendation, and more. But collecting enough annotated examples to train traditional classifiers can be quite costly. Instead, we take a look at a classic technique that can be used to perform text classification with few or even zero training examples! We're talking about text embeddings, of course. New advances have significantly increased the quality of document embeddings and in our newest writing on Few Shot Text Classification this cycle we cover

how to use them for topic classification,
best practices for using them,
and potential limitations.

Follow the links in the report to find code snippets so you can try it for yourself, and build your own demo so you can see the method in action!

Federated Learning open source

The Federated Learning report cover

Two years ago we wrote a research report about Federated Learning. We're pleased to make the report available to everyone, for free. You can read it online here: Federated Learning.

In the time since, it has only grown in relevance. Numerous startups have cropped up (and some disappeared by acquisition) with Federated Learning as their core technology. Google continues to promote the technology, including for non-machine learning use cases, as in Federated Analytics: Collaborative Data Science without Data Collection. This year saw (what we believe to be) the first conferences with a heavy focus on federated learning, The Federated Learning Conference and the Open Mined Privacy Conference, as well as dedicated workshops at high profile machine learning conferences like ICML and NeurIPS.

OpenMined continues to build a strong community around private machine learning, creating courses and open source tools to lower the barrier-to-entry to federated learning and related privacy enhancing techniques. Alongside those, TensorFlow Federated, IBM's federated learning library and flower.dev are extending the tooling ecosystem.

Federated Learning is no panacea. In a privacy setting, decentralized data simply presents a different attack surface to centralized data. Not all applications require or benefit from federation. However, it is an important tool in the private machine learning toolkit.

Deep Learning for Image Analysis

To accompany last month's research on Semantic Image Search (checkout the associated blog post Representation Learning 101 for Software Engineers), we're opening up some more previous reports:

Deep Learning for Image Analysis is an oldie, having been released back in 2015, but still provides an introduction for the uninitiated.
Our more recent release on the same, Deep Learning for Image Analysis: 2019 edition, substantially expands the first report and covers some practical considerations, like trading off accuracy and latency, and interpreting model predictions. July 2019 is a long time ago in computer vision research, and while the benchmarks may have improved, the underlying concepts discussed are still relevant.

Enterprise Grade ML

by Shioulin

At Cloudera Fast Forward, one of the mechanisms we use to tightly couple machine learning research with application is through application development projects for both internal and external clients. The problems we tackle in these projects are wide ranging and cut across various industries; the end goal is a production system that translates data into business impact.

What is Enterprise Grade Machine Learning?

Enterprise grade ML, a term mentioned in a paper put forth by Microsoft, refers to ML applications where there is a high level of scrutiny for data handling, model fairness, user privacy, and debuggability. While toy problems that data scientists solve on laptops using a csv dataset could be intellectually challenging, they are not enterprise grade machine learning problems.

The current state of Enterprise Grade ML

In many of our projects, the most difficult portion is understanding the business problem and defining a mathematical version that can be solved with the data that is available. Sometimes this mathematical version is not what the business stakeholders imagined it to be - this version might only partially solve the original business problem due to data realities. Very often, the business problem is broken down into smaller subproblems. The output of these subproblems then feed into a thin layer of business logic/rules to arrive at a final model output.

Once the problem is clearly defined, and data is flowing properly into the modeling environment, building a model is rather straightforward. When model building becomes convoluted, it can be taken as an indicator of an incorrect problem formulation. There are various ways to approach model building (feature creation, model selection, experimentation) ranging from fully custom approaches to highly automated processes. We are partial to the old-school Python-leveraging-packages approach but can envision the usefulness of AutoML if a data scientist has strong intuition about the business problem and solid understanding of the dataset.

In deployment (via containers or spark applications, for example), governance becomes paramount, especially in regulated environments. Data lineage, data versioning, model versioning, model explainability, model monitoring are all front and center.

Today, we very often need to stitch together ad hoc tools to accomplish all the above. What does the future look like? A recent paper outlines a 10-year prediction for enterprise-grade ML. Along the lines of Software 2.0, the authors view ML models as software derived from data. Most of us in the ML space would agree with this view, and would also acknowledge that even though ML is software, in today's practice we don't yet (always) adopt known best practices in software development.

Future state for Enterprise-Grade ML

The authors look to the future from three perspectives: i) model development/training ii) model scoring and iii) model management/governance.

Reference architecture for canonical data science lifecycle (Flock) [src](https://arxiv.org/abs/1909.00084)

Image Source: https://www3.cloudera.com/e/593381/abs-1909-00084/2df947g/910617499?h=dh3gRBAECaczUPkBMmYcPGKLS9qy4XpAUSrcICxCjCo

On model development/training, they believe training and development work will move to the cloud, either private or public. This is consistent with our observations.

On governance, the authors believe that all data, including deployed models (to be thought of as derived data) and inferences made using them will need to be robustly governed. This is something we attempt to do in our current projects - capture code that trained the model, training data that went into it, model inference results - albeit in an ad hoc/brittle way, depending on existing architecture.

The most interesting viewpoint (to me) is their perspective on model scoring. Because machine learning models are software artifacts derived from data, the dual nature of software artifacts and derived data suggests that the boundary between the data world and the modeling world will be fuzzy. The authors believe that inference pipelines will be close to data, and inference on data stored in a database management system should be done as an extension of the query runtime. In other words, models should be represented as first-class data types in a database management system. To investigate this, they "integrated ONNX Runtime (a performance-focused inference engine for ONNX) within SQL server and developed an in-database cross-optimizer between SQL and ML to enable optimizations across hybrid relational and ML expressions." Early results indicate that in-database management system inference is very promising.

As ML adoption quickens within enterprises and ML drives many business decisions, the attention will shift to effects of these models. To reach a state where ML models are defensible (privacy, security, interpretability, speed) without much technical debt, the DB community and the ML community will both shape the future of these ML end-to-end pipelines.

Research Updates

Interpretability

Towards the beginning of April, we hosted a webinar on interpretability entitled Opening the ML Black Box: Deploying Interpretable Models to Business Users. You can catch the replay here!

We also re-released our previous research on Interpretability (with a few updates) to the public.

Causality for Machine Learning

Machine learning allows us to detect subtle correlations in large data sets, and use those correlations to make accurate predictions. However, these subtle correlations are often spurious - they exist only in a particular dataset - and the resultant model performs poorly, or gives unexpected results in the real world.

Reasoning based on spurious correlations is dangerous. Business decisions should be based on things that are true, not things that are true only in a limited dataset. The trouble, of course, is identifying what is spurious and what is not.

Join us on May 28th at 10:00am PT / 1:00pm ET for a webinar on Causality for Machine Learning. During the webinar, we'll explain how combining causal inference with machine learning can help us address these problems. We'll cover:

when you should think about causality and lessons to apply in your data science practice
the latest research at the intersection of machine learning and causality
how causal thinking helps us write models that generalize to new circumstances, including an example of the causal approach applied to a computer vision problem

We'll also discuss the ethical implications of causality. We look forward to seeing you there!

NLP for Question Answering

Typically, our applied research culminates in a series of comprehensive reports provided to our customers on a quarterly basis, along with a live webinar demonstrating the prototypes we build in conjunction with that research. But times they are a-changin' and we're experimenting with new formats for distributing our content! This time, instead of waiting until the prototype is finished and the report is polished, we thought it would be fun to invite you to join us while we build.

We've launched a blog to host this endeavor at qa.fastforwardlabs.com. Learn more here, and follow us on Twitter for updates on when new content is posted!

Cloudera Machine Learning

Enabling Production MLOps at Scale - a Technical Preview

For enterprises, getting machine learning (ML) models to production and scale has been a significant challenge. Today, only an estimated 12% of ML models make it to production. To tackle this challenge, Cloudera has released Cloudera Machine Learning (CML) MLOps - a comprehensive and secure production ML platform, built on a 100% open-source standard and fully integrated with Cloudera Data Platform. CML breaks the walls to production and enables end-to-end ML workflows at scale.

Join this webinar on May 6th at 10:00am PT / 1:00pm ET to:

Learn how CML's MLOps functionality eliminates the model 'black box' and drives secure, transparent ML workflows from data to experimentation to production at scale.
Experience CML's robust and flexible model monitoring service for both technical metrics (latency, throughput, etc.) and the mathematical/functional monitoring - including first-class prediction tracking, metric stores, and Python SDK.
See how CML's unique model cataloging and model lineage capabilities eliminate silos and lead to better, faster results.

You can register here.

? %%current_year_YYYY%% Cloudera, Inc. All rights reserved.

395 Page Mill Rd., 3rd Floor, Palo Alto, CA 94306

Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera, Inc.
in the USA and other countries. All other trademarks are the property of their respective
companies. Information is subject to change without notice.

Terms & Conditions | Privacy Policy and Data Policy

To unsubscribe from future emails or to update your email preferences, click here.

.emailview

Updates from Cloudera Fast Forward on new research, porotypes, and exciting developments

View this email in browser

Welcome to the September edition of Cloudera Fast Forward's monthly newsletter. This month, we're happy to share some new research! This is the first example of a new research format we're experimenting with: more focussed and more frequent. Let us know what you think!

New research release today!

Meta-Learning

Meta-Learning report cover

In contrast to how humans learn, deep learning algorithms need vast amounts of data and compute and may yet struggle to generalize. Humans are successful in adapting quickly because they leverage their knowledge acquired from prior experience when faced with new problems. In this report we will explain how meta-learning can leverage previous knowledge acquired from data to solve novel tasks quickly and more efficiently during test time.

Our report, Meta-Learning is freely available online, and accompanied by code that applies the technique to an image dataset.

Research preview: Structural time series

Our next research release is coming to your screens in October, and will examine the application of generalized additive models to time series problems using the excellent Prophet package.

Events

Our first ever Research Roundup webinar is in preparation for late October. We'll cover both the just-released Meta-Learning research and our upcoming Structural Time Series report. Watch our social accounts and your email for more info soon!

? 2020 Cloudera, Inc. All rights reserved.

5470 Great America Pkwy, Santa Clara, CA 95054

Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera, Inc.
in the USA and other countries. All other trademarks are the property of their respective
companies. Information is subject to change without notice.

Terms & Conditions | Privacy Policy and Data Policy

To unsubscribe from future emails or to update your email preferences, click here.

.emailview

Updates from Cloudera Fast Forward on new research, porotypes, and exciting developments

View this email in browser

Welcome to the June edition of Cloudera Fast Forward's monthly newsletter. This month, alongside our regular recommended reading, we have two exciting research announcements!

New research: NLP for Question Answering

Here in the Fast Forward lab, we're always asking ourselves a lot of questions. Now we're asking BERT a lot of questions too! Our current research focus is question answering systems. In place of a report with all our learnings at the end of our research, we're inviting you to follow along as we explore building a question answering system using modern neural architectures. We just released our third blog in the series, and you can check out each of them below:

Intro to Automated Question Answering

This introductory post discusses what QA is and isn't, where this technology is being employed, and what techniques are used to accomplish this natural language task.

Building a QA System with BERT on Wikipedia

Follow along with this post to build a working Information Retrieval-based QA system, with BERT as the document reader and Wikipedia's search engine as the document retriever. This is a fun toy model that hints at potential real-world use cases.

Evaluating QA: Metrics, Predictions, and the Null Response

In this post, we look at how to assess the quality of a BERT-like model for Question Answering. We cover what metrics are used to quantify quality, how to evaluate your model using the Hugging Face framework, and the importance of the "null response" - questions that don't have answers - for both improved performance and more realistic QA output.

New report: Causality for Machine Learning

Our latest research report - Causality for Machine Learning - is live, and the webinar is available on demand!

Causality is an emerging area of focus in data science practice, especially when we want to make decisions based on our models. Causality provides a framework for understanding which statistical relationships are true, and which only appear to be true in some circumstances. Our report provides guidance on when and how we need to think about causality.

Even when a problem does not require causal reasoning, we can greatly improve the robustness and generalizability of our machine learning models by taking some lessons from causality. The report outlines techniques that enable machine learning models to perform well across diverse unseen environments, including those that they were not trained on. This is applicable to any machine learning problem where we would like our models to perform well across diverse environments. In particular there are applications in natural language processing and computer vision, which we demonstrate in the accompanying prototype, Scene.

Data Name	Data Type	Options
Company	Text Box
First name	Text Box
Last name	Text Box
Title	Text Box
Email	Text Box
Middle name	Text Box
	checklist
	checklist	Yes, I would like to be contacted by Cloudera for newsletters, promotions, events and marketing activities. Please read our privacy and data policy.
	checklist	Yes, I consent to my information being shared with Cloudera's solution partners to offer related products and services. Please read our privacy and data policy.

Start Searching Today!

Fast Forward Labs Sign Up Information

Email Address

Your Name

Your Address

Post-Registration Data

Validation

Membership Emails

Fast Forward Labs Dispatches

Introducing Anomagram � An Interactive Visualization of Autoencoders, Built with Tensorflow.js

A Symbiotic Relationship: Knowledge Graphs & Machine Learning

What is a knowledge graph?

A Familiar Example

Intersection of Machine Learning and Knowledge Graphs

Getting knowledge into a knowledge graph

Getting richer knowledge out of a knowledge graph

Final Thoughts

Recommended Reading

How front-end development can improve Artificial Intelligence.

Anti-Overfitting Techniques

ORL: Reinforcement Learning Benchmarks for Online Stochastic Optimization Problems

NLP�s Clever Hans Moment has Arrived

Why are so many AI systems named after Muppets?

Work from Everest Pipkin�s Data Gardens Class

Manifold

code2seq

Reformer: The Efficient Transformer

Upcoming Events

New research: NLP for Question Answering

New on the blog

Recommended reading

Events

Webinar: Deep Learning for Automated Question Answering

Upcoming Interpretability Webinar!

Bias in Knowledge Graphs - Part 1

image credit: Mediamodifier from Pixabay

Introduction

Motivation

image credit: bongkarn thanyakij from Pexels

Types of Bias

Bias Along the ML/Analytical Pipeline

Next Article

References

Notes

Recommended Reading

A Review of Neural Approaches to the Question Answering Task

Reliance on Metrics is a Fundamental Challenge for AI

A Primer in BERTology: What we know about how BERT works

DermGAN: Synthetic Generation of Clinical Skin Images with Pathology

(Podcast) Future of Programming: Orca with Devine Lu Linvega

Program design in the UNIX environment

Data Discovery Tools at Spotify

This is how AI bias really happens-and why it's so hard to fix

fastpages

Transformers are Graph Neural Networks

This neural net knows what smells good

Image source: xkcd.com

Our Recent Research

NLP for Question Answering

Causality for Machine Learning

Interpretability

Recommended reading

Events

Conference: Deep Learning for Anomaly Detection

New research release!

Structural Time Series

Research preview: Semantic Image Search

Recommended reading

Events

New research release!

Few-Shot Text Classification

Federated Learning open source

Deep Learning for Image Analysis

Recommended reading

Enterprise Grade ML

What is Enterprise Grade Machine Learning?

The current state of Enterprise Grade ML

Future state for Enterprise-Grade ML

Image Source: https://www3.cloudera.com/e/593381/abs-1909-00084/2df947g/910617499?h=dh3gRBAECaczUPkBMmYcPGKLS9qy4XpAUSrcICxCjCo

Recommended Reading