Below is a sample of the emails you can expect to receive when signed up to Fast Forward Labs.
Across many business use cases that generate data, it is frequently desirable to automatically identify data samples that deviate from “normal.” In many cases, these deviations are indicative of issues that need to be addressed. For example, an abnormally high cash withdrawal from a previously unseen location may be indicative of fraud. An abnormally high CPU temperature may be indicative of impending hardware failure.
The task of finding these anomalies is broadly referred to as Anomaly Detection, and many excellent approaches have been proposed (clustering-based approaches, nearest neighbors, density estimation, etc.). However, as data become high dimensional, with complex patterns, existing approaches (linear models which mostly focus on univariate data) can be unwieldy to apply. For such problems, deep learning can help.
In a recent post on Medium, I introduced Anomagram, an interactive visualization of how autoencoders can be applied to the task of anomaly detection. Anomagram is created as both a learning tool, and a prototype example of what an ML product interface could look like. The interface is built with Tensorflow.js and allows install-free experimentation in the browser.
The first part of the interface introduces important concepts (autoencoders, data transformations, thresholds, etc.) paired with appropriate interactive visualizations. The second part (pictured below) is geared towards more technical users and allows you to design, train, and evaluate an autoencoder model entirely in the browser.
If you’re interested in learning more about other deep learning approaches to anomaly detection, my colleagues and I will cover additional details on this topic in our upcoming report on Deep Learning for Anomaly Detection. (Please join us for a webinar on this topic on February 13th at 10:00am PT!)
In the meantime, you can read my full article on Medium, view the full demo of Anomagram here, and find the project source code here.
For the past decade, humans have unknowingly come to depend on Knowledge Graphs on a daily basis. From personalized shopping recommendations to intelligent assistants and user-friendly search results, many of these accepted (and expected) features have come to fruition through the exploitation of knowledge graphs. Despite their longstanding conceptual and practical existence, knowledge graphs were just added to the Gartner Hype Cycle for Emerging Technologies in 2018 and have continued to garner attention as an area of active research and development for their distinct ability to represent real-world relationships.
In this article, we’ll take a high-level look at what knowledge graphs are and explore a few ways they interact with the field of machine learning.
With all the hype comes confusion. In its simplest form, a knowledge graph is a set of data points linked by relations that describe a real-world domain. A cursory Google search will result in a myriad of explanations, but I believe there are a few core concepts that characterize a knowledge graph implementation.
A concrete and relatable application of knowledge graph technology is demonstrated by Google’s Knowledge Graph which was launched in 2012 and has become a relied upon feature for all Google search users.
When searching for a specific person, Google provides users a side panel that contains relevant information surrounding the entity/subject in the query. This quick insight is made possible by Google’s Knowledge Graph - a pre-populated knowledge base of connected facts relating people, places, and things. Because the graph structure effectively represents this type of data by design, the facts seen above can be easily called upon to provide contextual insight.
Now that we have established a baseline intuition of what knowledge graphs are, let’s take a look at ways machine learning and knowledge graphs support each other.
Because knowledge graphs preserve relational information (and are therefore more complex than traditional data representations), the data they take in demands a more refined state. Specifically, the edges between nodes must be established and then wrangled into a complementary form before populating a graph.
Let’s imagine a hand-crafted graph describing characteristics of Sir Alex Ferguson as seen above. Defining these entities and relationships is a simple endeavor for anyone knowledgeable of the English Premier League (EPL) and organizing the connections upfront allows the graph to be efficiently queried later on. But what happens if we want to create subgraphs for every manager in the EPL? Or every soccer manager in the world? Or every professional sports manager that ever existed?
Manually identifying all of these relationships by hand is not scalable. This is where machine learning and Natural Language Processing (NLP) offer intelligent solutions to automatically curate raw data into useable facts. The general techniques involved include sentence segmentation, part of speech tagging, dependency parsing, word sense disambiguation, entity extraction, entity resolution, and entity linking applied to corpuses of both structured and unstructured data.
The simplified example above is intended to highlight the general NLP process on a single sentence. In practice, organizations use more advanced, patented systems built on these underlying techniques to automatically extract information, resolve conflicting entities, and populate millions of entities into production knowledge graphs.
“Increasingly we’re learning that you can make better predictions about people by getting all the information from their friends and their friends’ friends than you can from the information you have about the person themselves.”
? - James Fowler, Connected
The quote above poses a justified, but unconventional approach to predictive modeling. Traditional machine learning focuses on modeling tabular data that inherently cannot represent all of the cascading relationships found within networks and knowledge graphs. This often means data scientists are left trying to abstract, simplify, and even leave out predictive relationships baked into a knowledge graph’s structure. But what if features of every node in a knowledge graph could be derived from the context of all the nodes and edges around them?
There are few different methods for making use of connected features in machine learning, but a main area of attention is Knowledge Graph Embeddings (KGE). The goal of KGE’s is to learn a fixed vector space representation of any given node in a graph based on its nearby connections. Drawing a quick parallel to the Word2Vec algorithm (and concept of word embeddings) - where we learn a fixed vector representation for every word in a corpus based on nearby words - helps to frame the concept of KGEs. Specifically, the Node2Vec model expands upon ideas from Word2Vec by first randomly traversing subgraphs for each node in a network to build a large number of sequences [sentences]. Once we have a body of graph sequences [corpus], we can utilize Word2Vec methodology as it applies to text sequences to produce graph node embeddings.
Ultimately by learning embedding representations from the full context of a knowledge graph, we can extract deeply rich features to be used in downstream tasks. A few uses are:
Knowledge graphs are an effective tool for modeling interconnected, real-world scenarios while retaining contextual details that are not easily captured with traditional data structures. In this article, we explored two examples that demonstrate the symbiotic relationship between knowledge graphs and machine learning, which only scratches the surface of the intersection between the two technologies. Additional concepts - like Graph Neural Networks and ML driven Entity Resolution - stand as exciting areas of research and application.
In keeping with our reputation as your data nerd friends, here’s a quick peek into what we’ve been reading lately:
This article from 2016 by our friend Ines at explosion.ai is still very much valid today. The role of front-end in data science is often restricted to visualization and dashboards. This is an enormous lost opportunity. It’s a personal resolution of mine to work more on interfaces for using and understanding machine learning systems in 2020. - Chris
This catalog of overfitting problems in ML models is both a pre-flight checklist for models, and a set of recipes to mitigate overfitting. It’s directed specifically to deep learning, but it’s applicable to other types of models as well. - Ryan
DRL shows promise for real world problems! Amazon applied DRL (via OpenAI Gym) to canonical operations research/supply chain problems such as bin packing, newsvendor and vehicle routing. They find that DRL beats or matches baseline. Next step - can this work for real world instances of these problems? (They think not yet). - Shioulin
This article questions the value of benchmark datasets for evaluating the true performance of NLP models. Some models may be exploiting shortcuts to obtain excellent scores while failing at the core of the task - in this case, reasoning and comprehension. - Victor
What may have started out as a bit of a joke actually highlights the importance of collaboration and respect within the open-source community towards the development of ML/AI products - characteristics that, as this article points out, are something Sesame Street fosters. One of my favorite excerpts from this article: “AI isn’t a discipline where lone scientists toil away in the lab at night, pumping electricity through processors, and cackling “It’s aliiiive” over a glowing command line. (Disclaimer: this certainly does happen, but it’s not always the most productive approach.)” - Danielle
Everest Pipkin provides a behind-the-scenes look at a creative coding class they recently taught. The assignments are all interesting and the student work looks great. I especially liked the “folder structure as memory palace” prompt. - Grant
Uber recently released a visual debugging tool for machine learning - Manifold. It is a model monitoring and debugging tool which compares feature distributions across tabular data subsets. It is model agnostic and helps users determine what data slices a model fails on and the potential causes for certain performance issues. It also integrates with Jupyter Notebook. It will be interesting to watch this space and how the features for the subsequent versions of Manifold unfold! - Nisha
A really interesting project from 2019 called code2seq introduced a method for generating natural language sequences from the structured representation of source code. This research sheds opportunity for automated code documentation and summarization. - Andrew
Brand new on arXiv this month, “Reformer: The Efficient Transformer” shows how old dogs can still learn new tricks. The authors reimplement the now-standard Transformer architecture (first brought to fame in the BERT NLP model) using Locality Sensitive Hashing, a long-standing tried-and-true technique for efficient look-up of similar items. This reduces the complexity of the algorithm and allows for longer sequences (e.g., sentences) to be used successfully. I love seeing classic techniques reinvented in modern algorithms! - Melanie
To unsubscribe from future emails or to update your email preferences, click here.
|Data Name||Data Type||Options|
|First name||Text Box|
|Last name||Text Box|
|Middle name||Text Box|
|checklist||Yes, I would like to be contacted by Cloudera for newsletters, promotions, events and marketing activities. Please read our privacy and data policy.|
|checklist||Yes, I consent to my information being shared with Cloudera's solution partners to offer related products and services. Please read our privacy and data policy.|