Didn’t make it to Big Data Day LA 2015? No problem! Check out complete presentations from our presenters below!
SCENES FROM BIG DATA DAY LA 2015 / Checkout Our Presentations!
Didn’t make it to Big Data Day LA 2015? No problem! Check out complete presentations from our presenters below!
Keynote Speakers / 2015 Keynotes
Reynold Xin is an Apache Spark PMC member and Chief Architect for Spark and co-founder at Databricks.
Speakers / 2015 Speakers
Organizers / 2015 Organizers
Volunteers / 2015 Volunteers
Abstract:- The City of Los Angeles, with 4 million residents and nearly 50 million visitors annually moving across 469 square miles, is not only one of the most densely populated cities, it also hosts one of the largest, most complex city infrastructures in the world. 6,000 miles of sewer underlie 22,000 miles of paved streets, that connect over 4,500 intersections, 50,000 city connected street lights and 2,000,000 google/waze connected sensors. This network of people and infrastructure are connected through the data and the systems that support them. As data transforms from an unstructured asset into the organizational wisdom that can drive this Smart City, the City of Los Angeles and the Office of Mayor Eric Garcetti work to identify new technologies and strategies for managing and harnessing the growing amount of data available to inform decision-making.
Abstract:- Netflix has a growing presence in Hollywood, with technical teams working on everything from high-speed video editing pipelines to machine learning methods for categorizing films. Data is foundational across these efforts and in this talk Josh will take a tour through why we invest so much in data about content, what data engineering challenges we tackle, and the style in which we do it.
Abstract:- Sophisticated machine learning models (like GBMs and Neural Networks) produce better predictions than simpler models (like linear or logistic regression), but sophisticated models do not produce interpretable 'effects' that specify the relationship between predictors and and outcome. This is because sophisticated models can learn non-linear, interactive, or even higher level relationships between the predictors and outcome without being explicitly specified. In many settings it is often important to understand, as best as possible, how 'black box' models are producing because:1. If users do not understand how a prediction is being made, they may not trust the model/prediction enough to act upon the model's suggestions. Significant business value can be derived from understanding what drives an outcome of interest (e.g. purchase or churn) in order to make product changes to accentuate or minimize desired effects 3. Understanding how predictors relate to an outcome can inform subsequent feature generation that can improve a model's predictive power. This talk will discuss two methods that have been proposed to better understand machine learning models: simulating changes in input variables (the R ICEbox package) or building a simpler model locally around specific predictions (the Python LIME package).
Abstract:- Artificial Intelligence is an important topic in the fight against cancer. Clinical Trails are at the frontier of innovation. I will discuss techniques, data sets and platforms we use at Deep 6 to bring patients to clinical trials. The focus will be on practical, repeatable methods I've developed at MySpace, Greenplum, UCLA and the US Intelligence Community.
Abstract:- One of the challenges faced when deploying a machine learning project into production, is how to build the real-time decision part of the system. Many open source projects exist to help construct machine learning pipelines, but you are often left to construct your own custom server to enable decision making. Redis can be used in place of custom code to build your serving system.
Abstract:- In this talk, SAS Vice President and non-profit founder Jill Dyche revisits the customer journey. After all in the age of omnichannel and digital everything, your customers are taking a different path than the one your Marketing department mapped out all those years ago. But big data's reach transcends the unstructured data of big corporations. Jill will explain how a personal mission led her to the realization that big data can be applied to many different journeys. She'll tell a story of how her work in the social sector with big data and analytics not only helps animal shelters refine outreach programs, but can save lives!
Abstract:- The Netflix data platform is constantly evolving, but at it's core, it's an all-cloud platform at a massive scale (60+ PB and over 700 billion new events per day), focused on enabling developers. In this talk, we'll dive into the current (data) technology landscape at Netflix, as well as what's in the works. We'll cover key technologies, such as Spark, Presto, Docker, and Jupyter, along with many broader data ecosystem facets (metadata, insights into jobs run, visualizing big data, etc.). Beyond just tech, we'll also dive a bit into our data platform philosophy. You'll leave with insights into how things work at Netflix, along with some ideas for re-envisioning your data platform.
Abstract:- What could a Strategy Consulting firm have to do with or say about big data? We see Big Data leading the way on new products but also disrupting our clients business processes and business models. For many clients and big data fans, the temptation is to think big data and machine learning disrupt the need for strategy. Just throw the data in the lake and a bunch of programmers with machine learning fishing poles and we will be done. Here is a rapid-fire review of what really happens Use case 1: Use case 2: Use case 3: What did we learn working with these clients? Strategy still matters: Data is cheap; attention is not. While data and computational power are increasingly plentiful, people have limited attention and energy. Complexity can kill not so much in the model itself but in how it affects processes and decisions. Data is not so cheap after all. We continue to underappreciate data architecture, governance, and engineering. These frequently take up most of the effort required for analytics success. Winning with Big Data is often less about the latest technology platform but in our strategy, culture, organizational capabilities, the way we implement algorithms, how we make decisions with data, and the impacts these have on employees and customers.
Abstract:- Everyone talks about how machine learning will transform business forever and generate massive outcomes. However, it's surprisingly simple to draw completely wrong conclusions from statistical models, and correlation does not imply causation is just the tip of the iceberg. The trend of the democratization of data science further increases the risk for applying models in a wrong way. This session will discuss. How highly-correlated features can overshadow the patterns your machine learning model is supposed to find this leads to models which will perform worse in production than during model building. How incorrect cross-validation lead to over-optimistic estimations of your model accuracy, especially we will discuss the impact of data preprocessing on the accuracy of machine learning models. How feature engineering can lift simple models like linear regression to the accuracy of deep learning but comes with the advantages of understandability & robustness.
Abstract:- Weibull reliability analysis predicts the life of products by fitting a distribution to a plot based on a population of units; multiple proprietary software applications are available to perform the analysis. The advent of Tableau + R Integration empowers Data Scientists and Reliability experts to make inferences drawn on populationsäó» failure characteristics by considering the ‘_- value of the distribution. With ‘_, we plot F(t), or unreliability over time, when leveraging Tableau + R Integration (Rscripts in Tableau calculated fields, pointing to R Server library for row-level execution). The Weibull analysis performed is superior to the Kaplan-Meiers method as it enables the more accurate Maximum Likelihood Estimate (MLE) curve fitting of plotted regression as opposed to Least Squares Estimate (LSE), which excludes R Integration and fails to precisely match parameters (shape, slope) that sophisticated existing reliability software packages produce. Application of Weibull for reliability analysis considers failure for given time in lifespan (t) when t= miles, cycles, hours, etc. The two-parameter distribution performed in this analysis includes beta and eta, or shape and scale parameters, respectively. Mean Time To Failure (MTTF) calculations are derived from these parameters as well. Variable Confidence Interval (CI) bands are used and can be adjusted using the interactive Tableau visualization. Industries utilizing Weibull analysis to plot Bathtub Curve assess the infant mortality, normal useful life, and end of life failures anticipated for a product (e.g. semiconductor chips, automotive parts, medical devices).
Abstract:- It's easy to collect millions of tweets, but not so easy to get right ones! During my senior year at college, we built a tool that lets you "surf" through Twitter. We were able to catch on-the-fly hashtags and add them into our search query, automagically. We tested it during elections and sporting events for tweets in both Turkish and English.
Abstract:- The data marts and warehouses we work with often require us to think about how to scope our analytic questions based on the finite amount of storage allocated to these enterprise components. With new innovations in the cloud space, we can leverage the near-infinite storage capacities of Data Lake object storage and use this as foundational source that can be combined with online data in the warehouse. In this talk we present reference architecture patterns based on Amazon Redshift Spectrum - a new technology enabling you to run MPP Warehouse SQL queries against exabytes of data in a backing object store. With Redshift Spectrum, customers can extend the analytic reach of their SQL interactions to push beyond data stored on local disks in the data warehouse to query vast amounts of unstructured data in the Amazon S3 Data Lake-- without having to load or transform any data.
Abstract:- Many firms are adopting a cloud first strategy and are migrating their on-premises technologies to the cloud. Logitech is one of them. We have adopted the AWS platform and big data on the cloud for all of their analytical needs, including Amazon Redshift and S3. In this presentation, I will present: The business rationale for migrating to the cloud. How data virtualization enables the migration. Running data virtualization itself in the cloud.
Abstract:- We will discuss how leveraging Control System Theory, which informed and lead to the advent of flight control systems, is ushering in a new data-science framework for bringing automated insights to Enterprise. As organizations grow, managers start relying on data from several sources to make decisions concerning the organization and its customers. Organizations with a growing gap between desired results and actual results need a control system to better manage and predict results using their existing data. Control Theory is typically used in two forms: (1) Goal Seeking Model: For example autopilot guiding an aircraft from A to B. We will show how an organization looking to increase profits can be modeled as a Goal Seeking organization. (2) Disturbance Rejection Model; Sometimes seen in temperature control systems in buildings, in Enterprise an organization seeking to minimize costs, can be modeled using the Disturbance Rejection model. Both of these concepts are based on sound scientific principles that can be used to model and control businesses. The talk will include two real-world cases of how Emad Hasan translated these ideas into data science applications at Facebook and PayPal which powered executive decision making, as well as how they can now be used by data scientists and managers around the world to improve insights and cues for business executives.
Abstract:- BI and analytics are at the top of corporate agendas. Competition is intense, and, more than ever, organizations require fast access to insights about their customers, markets, and internal operations to make better decisionsäóîoften, in real time. Enterprises face challenges powering real-time business analytics and systems of engagement (SOEs). Analytic applications and SOEs need to be fast and consistent, but traditional database approaches, including RDBMS and first-generation NoSQL solutions, can be complex, a challenge to maintain, and costly. Companies should aim to simplify traditional systems and architectures while also reducing vendors. One way to do this is by embracing an emerging hybrid memory architecture, which removes an entire caching layer from your front-end application. This talk discusses real-world examples of implementing this pattern to improve application agility and reduce operational database spend.
Abstract:- Traditional machine learning and feature engineering algorithms are not efficient enough to extract complex and nonlinear patterns hallmarks of big data. Deep learning, on the other hand, helps translate the scale and complexity of the data into solutions like molecular interaction in drug design, the search for subatomic particles and automatic parsing of microscopic images. Co-locating a data processing pipeline with a deep learning framework makes data exploration/algorithm and model evolution much simpler, while streamlining data governance and lineage tracking into a more focused effort. In this talk, we will discuss and compare the different deep learning frameworks on Spark in a distributed mode, ease of integration with the Hadoop ecosystem, and relative comparisons in terms of feature parity.
Abstract:- How Warner Bros. is Using Elastic to Solve Entertainment and Media Problems at Scale Warner Bros. processes billions of records each day Globally between its web assets, digital content distribution, OTT streaming services, online and mobile games, technical operations, anti-piracy programs, social media, and retail point of sale transactions. Despite having large MPP clusters, a significant amount of dark data remained trapped in Web Logs. In this presentation, we will discuss how Warner Bros. leveraged the new Elastic 5 stack coupled with Apache Spark, to deliver scalable insights and new capabilities to support business needs.
Abstract:- The Tone Knows is a Internet of Things Marketing and Adverting Vendor and Platform. We implement a next generation patent pending Internet of Things Advertising Technolgy using Audio Beacons. These Audio Beacons, unlike proximity beacons do not use bluetooth or wifi to transmit to the mobile phone. There is no hardware required. Here is a short demonstration of a Audio Beacon program we did with music artist Ariana Grande https://www.youtube.com/watch?v=rP36bCuA4kM In the demonstration, a high frequency tone is embedded inside the Music Video on the laptop. When the audio tone goes off, the mobile phone (which has opted in and have microphone turned on) can receive the audio tone and is sent a hyperlink to ecommerce, promotion, contest or anywhere else we sent it to. This is way to connect TV and TV Advertising, Radio and Radio Advertising, Youtube Video and Banner advertising to connect from content to mobile smartphone consumers.
Abstract:- As a data scientist, I get to see a broad spectrum of the 'good', 'bad', and 'ugly' implementations of engineering and data practices. I'd be happy to share my tips and experiences with the broader community: the do's and don'ts of working with data in production, for collaboration, and for getting actionable insights.
Abstract:- Apache Cassandra is known as the go-to database for cloud applications requiring large amounts of data storage with elastic scalability across multiple data centers. Spark is an in-memory analytics framework that supports both realtime and batch processing, with extensions for streaming, machine learning, and SQL. Jeff Carpenter, Technical Evangelist at DataStax, will share how DataStax Enterprise puts these powerful technologies together to solve common use cases in domains including entertainment and IoT. We’ll explore architectures for intelligent applications that leverage DSE to provide real-time operational analytics.
Abstract:- As the entertainment industry faces a landscape of exponential opportunities and threats, it is quietly turning to artificial intelligence to manage risk, develop operational efficiencies, and make more data-driven decisions. From developing cognitive solutions to assess why we think certain films and characters are more interesting than others, to isolating granular, scene-level story and character mechanics that drive better box office returns, Hollywood has fully caught up on other industries in leveraging high-end analytics methods and tools. _As the director of the Data & Analytics Project at USC's prestigious Entertainment Technology Center (created by George Lucas in 1993), Yves Bergquist sits at the center of this revolution. He and his team are developing next-generation AI tools and methods that are being deployed throughout the entertainment industry. Because his research is funded by all 6 Hollywood studios, and he personally answers to all the CTOs of those studios, Yves has unique and powerful insight into how Hollywood is quietly using machine intelligence to take its hit-making game to the next level. What the audience will learn: the audience will go behind the scenes to discover how precisely Hollywood studios are using data, analytics and AI to make better development, production and distribution decisions. Yves will draw from his and his team's research and use case studies to lift the veil on how AI, game theory, and neuroscience are transforming audience intelligence, film development, and distribution strategies.
Abstract:- As data science continues to mature and evolve, the demand for more computationally extensive machines is rising. GPU Computing provides the core capabilities that data scientists today are looking for, and when implemented effectively, it accelerates deep learning, analytics and other sophisticated engineering applications. During this talk, Armen Donigian, Data Science Engineer at ZestFinance, will introduce the GPU programming model and parallel computing patterns, as well as practical implications of GPU computing, such as how to accelerate applications on a GPU with CUDA (C++/Python), GPU memory optimizations and multi GPU programming with MPI and OpenACC. As an example of how GPU programming can be implemented in real-life business models, Armen will present how ZestFinance has successfully tapped into the power of GPU Computing for the deep learning algorithm behind its new platform, Zest Automated Machine Learning platform (ZAML). Currently, ZAML is used by major tech, credit and auto companies to successfully apply cutting-edge machine learning models to their toughest credit decisioning problems. ZAML leverages GPU Computing for data parallelism, model parallelism and training parallelism.
Abstract:- Hortonworks DataFlow (HDF) is built with the vision of creating a platform that enables enterprises to build dataflow management and streaming analytics solutions that collect, curate, analyze and act on data in motion across the datacenter and cloud. Do you want to be able to provide a complete end-to-end streaming solution, from an IoT device all the way to a dashboard for your business users with no code? Come to this session to learn how this is now possible with HDF 3.0.
Abstract:- The value of data is exponentially related to the number of people and applications that have access to it. The City of Los Angeles embraces this philosophy and is committed to opening as much of its data as it can in order to stimulate innovation, collaboration, and informed discourse. This presentation will be a review of what you can find and do on our open data portals as well as our strategy for delivering the best open data program in the nation.
Abstract:- Waze Carpool is the evolution of the Waze mission. If at first we wanted to help you save time by finding the fastest route to your destination, now we want to help you avoid traffic--by eliminating it all together. Traffic is a simple problem. There are too many cars on the road with too many empty seats.
Abstract:- It is a rare occurrence to observe the rise of a new language amongst a population. It is an even more rare occurrence to observe the adoption of such a language on a global scale. Since the introduction of the emoji keyboard on iOS in 2011, the use of emojis in textual communication has steadily grown into a common vernacular on social media. As of April 2015, Instagram reported that nearly half of all text contained emojis and, in some countries, over 60% of texts contained emoji characters. For power users of social media as well as for marketers looking for audiences on these platforms, it is becoming increasingly imperative to capture emoji data and derive insight from its use; to better understand what intent or meaning the usage carries in the conversation. Jeff Weintraub, VP of Technology at theAmplify, a creative Brandtech Influencer Service and a subsidiary of You & Mr Jones, the World's First Brandtech Group, will briefly summarize the data science behind learning emoji representations and also present recent trends in emoji usage within the context of advertising and branded marketing campaigns on social media.
Abstract:- We have created variety of Analytics Solutions combining data from our Data Lake with Traditional DW. Data API's which are fed into product for improving conversions, Churn prediction alogrithm to help account managers focus on high risk customers, using analytics as an edge to empower sales team to win prospective customers.
Abstract:- The talk will focus on how Neural Networks are applied in the field of NLP for tasks like classification. Building blocks like Word Embeddings, Recurrent NN, LSTM, GRU, Convolutional NN, Sentence Representation and how they are applied to a piece of text in Tensorflow will be covered. These building blocks can be stacked together in various ways to form deeper network architectures. We will discuss one such architecture which is used within GumGum Inc to do Sentiment Analysis on web pages using NN in Tensorflow.
Abstract:- The Honest Company is one of the leading national e-commerce brands, focused on promoting healthy and happy lives. One core challenge we face is measuring ROI across the numerous marketing acquisition channels and in turn, optimizing our budget allocation. Luckily, we have large amounts of data, both structured and unstructured, which is used to learn patterns and insights about how we acquire new customers. We leverage this data to build machine learning models for smart customer segmentation, which helps the acquisition team derive maximum ROI out of every dollar spent. In this talk, we will touch on some of our machine learning approaches and how we leverage data to predict attributes like customer lifetime value and customer churn rates, which are then used to optimize spend allocation.
Abstract:- Most real-world data science workflows require more than multiple cores on a single server to meet scale and speed demands, but there is a general lack of understanding when it comes to what machine learning on distributed systems looks like in practice. Gartner and Forrester do not consider distributed execution when they score advanced analytics software solutions. Many formal machine learning training occurs on single node machines with non-distributed algorithms. In this talk we discuss why an understanding of distributed architectures is important for anyone in the analytical sciences. We will cover the current distributed machine learning ecosystem. We will review common pitfalls when performing machine learning at scale. We will discuss architectural considerations for a machine learning program such as the role of storage and compute and under what circumstances they should be combined or separated.
Abstract:- Optimizing online display advertising is a complicated task that often requires a combination of domain knowledge and intuition on the part of a marketer. But with so much advertising occurring within an algorithmic marketplace, at large scale, and under deadline, getting the most of your advertising dollars can be optimized with data science. In this talk, we delve into a custom data science solution for optimizing Facebook advertising campaigns, and how that solution ultimately saves time and boosts profits far beyond what intuition alone can achieve.
Abstract:- Using Machine Learning to Identify Major Shifts in Human Gut MicrobiomeProtein Family Abundance in Disease Inflammatory Bowel Disease (IBD) is an autoimmune condition that is observed to be associated with major alterations in the gut microbiome taxonomic composition. Here we classify major changes in microbiome protein family abundances between healthy subjects and IBD patients. We use machine learning to analyze results obtained previously from computing relative abundance of ~10,000 KEGG orthologous protein families in the gut microbiome of a set of healthy individuals and IBD patients. We develop a machine learning pipeline, involving the Kolomogorv-Smirnov test, to identify the 100 most statistically significant entries in the KEGG database. Then we use these 100 as a training set for a Random Forest classifier to determine ~5% the KEGGs which are best at separating disease and healthy states. Lastly, we developed a Natural Language Processing classifier of the KEGG description files to predict KEGG relative over- or under- abundance. As we expand our analysis from 10,000 KEGG protein families to one million proteins identified in the gut microbiome, scalable methods for quickly identifying such anomalies between health and disease states will be increasingly valuable for biological interpretation of sequence data.
Abstract:- These days, for successfully running any business, it’s imperative to derive insights from their data. This is now a matter of survival in this cut-throat competitive market where Cloud based innovations have significantly lowered the barriers to entry. While larger enterprises with deep pockets can afford to build sophisticated analytical solutions, the SMBs find it very difficult to build any. Even if SMBs build an analytical system with a low budget, they still get burnt as they produce low quality metrics and eventually end up becoming a cost-center. This presentation will showcase an end-to-end use case for SMBs focused in the Retail industry to derive deep insights by just a few clicks, without the need to buy any expensive hardware, software or hire expensive technical personnel.
Abstract:- A walk through of the architecture designed at Tronc for A/B Testing multiple algorithms for delivering personalized content recommendations to 60 million readers a month.
Abstract:- Data Scientists come in all shapes, sizes, and personalities, from perhaps a more diverse set of academic and industrial backgrounds than other jobs in tech. This talk explores ways to hire a team with complementary skill sets and backgrounds, the obvious and not-so-obvious benefits of diversity, and challenges teams face when learning to work together.
Moderator: Stuart McCormick. Panelists: John Sullivan and Suresh Paulraj.
Abstract:- Artificial intelligence blurs the lines between sales and marketing by enabling humans to leverage the power of Big Data to command and control every customer's journey. A new category of technology called "intelligent demand generation" enables marketers to explain and predict buyer behavior with unprecedented precision and speed. These capabilities reframe how companies go to market by enabling microtargeting of customers with context-specific content marketing. Early adopters of intelligent demand generation technologies are realizing more than 500% return on investment within 2 months. Three-time AI startup founder and CEO of LeadCrunch describes how his company developed military targeting technology then modified it to make commercial sales teams more efficient.
Abstract:- We have lots of data at Factual. But to solve some of our harder problems, we need to get down and dirty with our data -- to examine, evaluate, and experience it (sometimes even smell it). This talk attempts to re-brand this kind of work with the new, alternative buzzword, "Artisanal Data". I review the "Artisanal Data" technologies and techniques we use at Factual, including how we document experiments so that they get read, evaluate failure modes and judge successes, and keep our annotation data as accurate as possible. With the right statistical precautions, Artisanal Data can use be used to more effectively and emotionally communicate impact of our data.
This year we are excited to add something new to Big Data Day LA. In partnership with TenOneTen Ventures, we are bringing together some of the best data driven startups in Southern California! Five startups will have the opportunity to pitch to a panel of judges that range from VCs to data experts. The winner will receive a $1500 cash prize, $1,000 in MongoDB Atlas DBaaS credit and 3 strategy sessions with VCs.
Abstract:- There are some key trends that are emerging in 2017 within the Hadoop and Big Data ecosystem, which center around the increase of the use of cloud. These trends are the underpinning for a larger shift toward purpose-driven products positioned as the core of an organization's data strategy. Certainly there are examples of this in early adopters that are mature in their deployments, but what about those more traditional end user organizations in the midst of a digital transformation? How is the relationship between IT and the needs of the growing data science align for business development? What are their views on these trends? How are their organizations reconciling the needs and desires to pave a path forward? In this session, Roman will present ODPi's findings and end-user views of Big Data trends based on data from the ODPi End User Advisory Board (TAB). Audiences will get real end-user perspectives from companies such as GE about how they are using Big Data tools, challenges they face and where they are looking to focus investments - all from a vendor-neutral viewpoint.
Abstract:- The truth about enabling self-service (and why you need it) Data is growing astronomically, historically and in real-time. So is the need for exploration and discovery. One size doesn’t fit all. We’ll be covering how to efficiently deliver information on-demand and promote self-service adoption with the right data platform.
Abstract:- Discussion of various applied cases of artificial intelligence solutions stemming from our venture work across verticals including consumer, financial and media.
Abstract:- As IT evolves from a cost center to a true nexus of business innovation, data engineers, platform engineers and database admins need to build the enterprise of tomorrow. One that is scalable, and built on a totally self-service infrastructure. Having an agile, open and intelligent big data platform is key to this transformation. Join Qubole for an overview of the 5 stages of transformation that companies need to follow. Discover how to create a data-driven culture. Hear from the co-author of Apache Hive as he shares how Facebook and others became data-insights driven.
Abstract:- Algorithmic innovations like NUTS and ADVI, and their inclusion in end user probabilistic programming systems such as PyMC3 and Stan, have made Bayesian inference a more robust, practical and computationally affordable approach. I will review inference and the algorithmic options, before describing two prototypes that depend on these innovations: one that supports decisions about consumer loans and one that models the future of the NYC real estate market. These prototypes highlight the advantages and use cases of the Bayesian approach, which include domains where data is scarce, where prior institutional knowledge is important, and where quantifying risk is crucial. Finally I'll touch on some of the engineering and UX challenges of using PyMC3 and Stan models not only for offline tasks like natural science and business intelligence, but in live end-user products.
Abstract:- Big data has become the new hot topic in recent years. It promotes the understanding of the exploit of data and directs the decision guidance in many sectors. The health science field is also shaped by the innovative idea of big data application. Our study group from the department of preventive medicine of the Keck school of medicine of the University of Southern California aims to build a big data architecture that combines and analyzes data of people from difference sources and provide health related assessments back to them. Specifically, ecological momentary assessments (EMAs), electronic medical records (EMRs), and real-time air quality monitor data of children with pre-existing asthma diagnosis are collected and fed into the machine learning models. Asthma exacerbation alert is generated and delivered back to the children before it happens. The machine learning model was tested and built in a similar study. The study population consists of children from a cohort of the prospective, population-based Children's Health Study followed from 2003-2012 in 13 Southern California communities. Potential risk factors were grouped into five broad categories: sociodemographic factors, indoor/home exposures, traffic/air pollution exposures, symptoms/medication use, and asthma/allergy status. The outcome of interest, assessed via annual questionnaire, was the presence of bronchitic symptoms over the prior 12 months. A gradient boosting model (GBM) was trained on data consisting of one observation per participant in a random study year, for a randomly selected half of the study participants. The model was validated using hold-out test data obtained in two complementary approaches: (within-participant) a random (later) year in the same participants and (across-participant) a random year in participants not included in the training data. The predictive ability of risk factor groupings was evaluated using the area under receiver operating characteristic curve (AUC) and accuracy. The predictive ability of individual risk factors was evaluated using the relative variable importance. Graphical visualization of the predictor-outcome relationship was displayed using partial dependency plots. Interaction effects were identified using the H-statistic. Gradient boosting model offers a novel approach to better understand predictive factors for chronic upper respiratory illness such as bronchitic symptoms.
Abstract:- A colleague of mine once asked Why would you ever need over 100k observations for ? It's surprising that many models we develop don't need much data to reach a good level of accuracy. In this talk we'll discuss how Activision leverages large datasets and feature space from the Call of Duty Series to build complex models. We'll also talk about how to transfer these learnings to more digestible simple models and how accuracy in these models translates to usable in-game action. Finally we'll showcase some of our model development pipeline and thoughts on when you really do need millions or billions of data points to make substantial improvements to the game.
Abstract:- Today's enterprises are not only producing data in high volume but also at high velocity. With velocity comes the need to process the data in real time. To meet the real time needs, we developed and deployed Heron, the next generation streaming engine at Twitter. Heron processes billions and billions of events per day at Twitter and has been in production for nearly 3 years. Heron provides unparalleled performance at large scale and has been successfully meeting Twitter's strict performance requirements for various streaming and iOT applications. Heron is a open source project with several major contributors from various institutions. As the project, we identified and implemented several optimizations that improved throughput by additional 5x and further reduce latency by 50-60%. In this talk, we will describe Heron in detail, how the detailed profiling indicated the performance bottleneck areas such as multiple serializations/deserialization and immutable data structures. After mitigating these costs, we were able to show much higher throughput and latencies as low as 12ms.
Abstract:- Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform for building real-time streaming data pipelines and streaming data applications without the need for other tools/clusters for data ingestion, storage and stream processing. In this talk you will learn more about: A quick introduction to Kafka Core, Kafka Connect and Kafka Streams through code examples, key concepts and key features. A reference architecture for building such Kafka-based streaming data applications. A demo of an end-to-end Kafka-based streaming data application.
Abstract:- Secure concurrent access to the data from on-demand compute clusters is a huge organizational and technical challenge. A few people possess the skills, experience and enough of the security expertise to create and control a clean cloud environment equipped with the state-of-the-art data science, analytic and visualization technologies. Commercial products are often too late with features, deliver unknown quality and security guarantees, and carry a hefty license fees. Developed by EPAM engineers and publicly available under ALv2, DLab framework addresses all these concerns by providing an of the shelf, simple to use platform. DLab allows anyone to setup a completely secure cloud environment empowered with data science notebook software, scaleable cluster solution and powerful compute engine based on Apache Spark.
Abstract:- Today, in many data science projects, the sole focus is the complexity of the algorithms being used to address the data problem. While this is a critical consideration, without consideration of how the insights developed can be disseminated through the broader enterprise, many end up dying on the vine. This presentation will highlight not only that turnkey model operationalization strategy is critical to the success of enterprise data science projects, but how this can be achieved using Spark. Today Spark enables data scientists to perform sophisticated analyses using complex machine learning algorithms. Even when the size of the datasets are measured in Terabytes, Spark provides a broad selection of machine learning algorithms that can scale effortlessly. However, the current process for the business to leverage the results of these analyses is far less sophisticated. Indeed, results are frequently communicated by powerpoint presentation, rather than a turn-key solution for deploying improved models into production. In this session, we discuss the current challenges associated with operationalizing these results. We discuss the challenges associated with turnkey model operationalization, including the shortcomings of model serialization standards such as PMML for expressing the complex pre- and post- data processing that is critical to effortless operationalization. Finally, we discuss in detail the potential for turnkey model operationalization with the emerging PFA standard, and highlight how the use of PFA can be achieved using Spark, including how PFA model scoring can be supported using Spark streaming, and our efforts to drive support for PFA model export into MLlib.
Abstract:- Guitar Center is the largest retailer of musical instruments in the US with over 275 stores and an annual revenue of over $2.1B. My talk will focus on how data is leveraged here to drive optimal decision making. We have a data warehouse that collects information on transactions, traffic, products, inventory, customer and many more from stores and online. During the talk, I will walk through various insights that we have derived by analyzing the data. In one project, we blended Experian provided Household level data with our internal data to build customer profiles on purchasers of high-end guitars. We used this information with drive-time analysis to pick stores to build Platinum rooms exclusively dedicates to high-end guitars. I will also run through how we use extensive experimentation to test strategies before a chain wide roll out. We pick a few stores to pilot, use KNN to select like stores and perform a pre/post analysis to evaluate lift and its statistical significance.
Abstract:- Of all the developers delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs - RDDs, DataFrames, and Datasets available in Apache Spark 2.x. In particular, I will emphasize why and when you should use each set as best practices, outline its performance and optimization benefits, and underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you'll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them.
Abstract:- See the VariantSpark library in action on a Databricks Jupyter notebook.
Abstract:- In this talk, we propose a generalized machine learning framework for e-commerce businesses. The framework is responsible for over 30 different user-level predictions including lifetime value, recommendations, churn predictions, engagement and lead scoring. These predictions provide a vital layer of intelligence for a digital marketer. Kinesis is used to capture browsing information from over 120M users across 100 companies (both in-app and web). A data processing and feature engineering layer is build on Apache Spark. These features provide inputs to predictive models for business applications. Different models each for Churn, Lifetime value, Product recommendation and search are written on Spark. These models can be plugged into any marketing campaign for any integrated e-commerce company leading to a generalized system. We finally present a monitoring system for machine learning called RS Sauron. This system provides more than 200 objective metrics measuring the health of predictive models, and depicts KPIs for model accuracy in a continual setting.
Abstract:- Learn how to leverage your data science skills to earn a fortune as a freelance quant in your spare time. At Quantiacs, we run the world's largest quant finance algorithm competition in the world, and host the world's only marketplace for quantitative trading algorithms. We provide 26 years of free financial data, an open-source toolkit in multiple languages, and access to investment capital. The audience will learn about the future of quantitative finance and how to cash in.
Abstract:- Quark is a new Scala DSL for performing high-performance data processing and analytics on sources of structured and semi-structured data. Built with an advanced optimizing analytics compiler, Quark can push computation down to any supported data source, offering performance typically seen only with hand-written code. Quark's type-safe API allows developers to create correct pipelines and analytics, while the flexibility provides an ability to directly manipulate semi-structured data formats like JSON and XML. Quark has native support for MongoDB, MarkLogic, Couchbase, and any data source with a Spark connector.
Abstract:- Blizzard the creating of leading games Hearthstone, World of Warcraft, Overwatch and more has BIG Data. In this session we will give a glimpse into some of the awesome things we are doing at Blizzard to make our games better through big data.
Abstract:- IOT technology will allow big data structures to evolve from static off line repositories of digital knowledge to on-line representations of our current world. IOT will allow the techniques used with big data to identify trends and forecast future to become operationally enabled data structures that allow us to manage our digital environment for maximal advantage. The road to this reality has several hurdles that must first be overcome. Among these hurdles are trust, privacy, discovery, and behavioral economics. These issues will be discussed in the context of a large city operations network and potential options to overcome these hurdles will be offered.
Abstract:- This session presents a Hadoop DataSource implementation for integrating and joining Big Data with Master Data in RDBMS.
Abstract:- Companies are adopting big data for performing high-velocity real-time analytics on very large volumes of data to enable rapid analysis for business users using self-service and never-before-realized use cases. However, such projects have yielded limited value because these big data systems have become siloed from the rest of the enterprise systems holding critical business operational data. Big Data Fabric is a modern data architecture combining data virtualization, data prep, and lineage capabilities to seamlessly integrate at scale these huge, siloed volumes of structured and unstructured data with other enterprise data assets. This presentation will demonstrate with proven customer case studies in big data and IoT about the value of using big data fabric as a logical data lake for big data analytics.
Abstract:- IoT Market overview and Verizon’s focus on specific IoT verticals (AgTech, Energy, Share, etc.), Criteria for evaluation of IoT data analytics opportunities, Platform considerations for big data solutions (security, network and platform connectivity, data analytics processing/storage, applications etc.), Examples of a few big data solutions at Verizon
Abstract:- There is an urgent need in the pediatric ICUs to collect, store and transform healthcare data to make accurate and timely predictions in the areas of patient outcomes and treatment recommendations. We are currently heavily invested in using open source big data stacks in order to achieve this goal and help our young ones. In this talk I can highlight how we go about managing structured and unstructured high frequency data generated from a disparate set of devices and systems and ultimately how we have created data pipelines to process the data and make it available for data scientists and app developers.
Abstract:- The Elastic has released a commercial machine learning plugin that allows you to create a model of your time series data using an unsupervised machine learning approach. Walk through a few common use cases to see how this plugin may help with finding anomalies in your data.
Abstract:- Telecommunications service providers (or telcos) have access to massive amounts of historical and streaming data about subscribers. However, it often takes them a long time to build, operationalize and gain value from various machine learning and analytic models. This is true even for relatively common use-cases like churn prediction, purchase propensity, next topup or purchase prediction, subscriber profiling, customer experience modeling, recommendation engines and fraud detection. In this talk, I shall describe our approach to tackling this problem, which involved having a pre-packaged set of analytic pipelines on a scalable Big Data architecture that work on several standard and well known telco data formats and sources, and that we were able to reuse across several different telcos. This allows the telcos to deploy the analytic pipelines on their data, out of the box, and go live in a matter of weeks, as opposed to the several months it used to take if they started from scratch. In the talk, I shall describe our experiences in deploying the pre-packaged analytic pipelines with several telcos in North America, South East Asia and the Middle East. The pipelines work on a variety of historical and streaming data, including call data records having voice, SMS and data usage information, purchase and recharge behavior, location information, browsing/clickstream data, billing and payment information, smartphone device logs, etc. The pipelines run on a combination of Spark and Unscrambl BRAINTM, which includes a real-time machine learning framework, a scalable profile store based on Redis and an aggregation engine that stores efficient summaries of time-series data. I shall describe some of the machine learning models that get trained and scored as part of these pipelines. I shall also remark on how reusable certain models are across different telcos, and how a similar set of features can be used for models like next topup or purchase prediction, churn prediction and purchase propensity across similar telcos in different geographies.
Abstract:- Capital One receives thousands of legal requests every year, often as physical mail. During this talk, we'll dive into how the Center for Machine Learning at Capital One have built a self contained platform for summarizing, filtering and triaging these legal documents, utilizing Apache projects.
Abstract:- Building distributed systems is challenging. Luckily, Apache Kafka provides a powerful toolkit for putting together big services as a set of scalable, decoupled components. In this talk, I'll describe some of the design tradeoffs when building microservices, and how Kafka's powerful abstractions can help. I'll also talk a little bit about what the community has been up to with Kafka Streams, Kafka Connect, and exactly-once semantics.
Abstract:- It's easier than ever to power serverless architectures with managed database services like MongoDB Atlas. In this session, we will explore the rise of serverless architectures and how they've rapidly integrated into public and private cloud offerings. We will demonstrate how to build a simple REST API using AWS Lambda functions, create a highly available cluster in MongoDB Atlas, and connect both via VCP Peering. We will then simulate load and use the monitoring and scale features of MongoDB Atlas and use MongoDB Compass to browse our database.
Abstract:- This session will explore how DC/OS and Mesos are being used at Esri to establish a foundational operating environment to enable the consumption of high velocity IoT data using Apache Kafka, streaming analytics using Apache Spark, high-volume storage and querying of spatiotemporal data using Elasticsearch, and recurring batch analytics using Apache Spark & Metronome. Additionally, Esri will share their experience in making their application for DC/OS portable so that it can easily be deployed amongst public cloud providers (Microsoft Azure, Amazon EC2), private cloud providers and on-premise environments. Demonstrations of will be performed throughout the presentation to cement these concepts for the attendees. All demos will be available on a public github repo.
Abstract:- The talk covers: 1. Properties of IoT data and database requirements for handling it; 2. Introduces the purpose-built NoSQL database for IoT called GridDB; 3. IoT data and time-series data modeling; 4. Real-world deployed IoT use cases
Abstract:- Machine learning models may be very powerful, but many data sets are only released in aggregated form, precluding their use directly. Various heuristics can be used to bridge the gap, but they are typically domain-specific. The data augmentation algorithm, a classic tool from Bayesian computation, can be applied more generally. We will present a brief review of DA and how to apply it to disaggregation problems. We will also discuss a case study on disaggregating daily pricing data, along with a reference implementation R package.
Abstract:- Organizations commonly use Big Data computation frameworks like Apache Hadoop MapReduce or Apache Spark to gain actionable insight from their large amounts of data. Often, these analytics are in the form of data processing pipelines, where there are a series of processing stages, and each stage performs a particular function, and the output of one stage is the input of the next stage. There are several examples of pipelines, such as log processing, IoT pipelines, and machine learning. The common attribute among different pipelines is the sharing of data between stages. It is also common for data pipelines to process data stored in the public cloud, such as Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Storage. The global availability and cost effectiveness of these public cloud storage services make them the preferred storage for data. However, running pipeline jobs while sharing data via cloud storage can be expensive in terms of increased network traffic, and slower data sharing and job completion times. Using Alluxio, a memory speed virtual distributed storage system, enables sharing data between different stages or jobs at memory speed. By reading and writing data in Alluxio, the data can stay in memory for the next stage of the pipeline, and this result in great performance gains. In this talk, we discuss how Alluxio can be deployed and used with a data processing pipeline in the cloud. We show how pipeline stages can share data with Alluxio memory for improved performance benefits, and how Alluxio can improves completion times and reduces performance variability for pipelines in the cloud.
Abstract:- ClickHouse is an OpenSource real-time analytical database that handles petabyte scale data sizes with massive linear scaling and SQL-like language.
Abstract:- Imagine an industry that does not have an HR record for its employees. How do you comply with the Affordable Care Act's (ACA) health insurance eligibility determination when you don't know when someone started or stopped working for a particular company? That is the situation that the entertainment industry faced in 2013 as the ACA loomed on the horizon. Entertainment Partners, the largest provider of payroll and other related services to the entertainment industry for its production workforce set out to solve the problem. We coordinated across all of the industry's payroll providers, created a data analytics engine that ingests, aggregates and analyzes millions of transactions and determines which of their production workers meets the ACA eligibility criteria. We help the industry stay in compliance and avoid costly government penalties and we used Big Data to solve the problem.