We’re excited to announce

Big Data Day LA 2017!

Join us on Saturday, AUG 5, 2017

at the University of Southern California.

Accepting registrations, sponsors, speakers and volunteers!

Startup Showcase / 2017 Big Data Day LA Startup Showcase in collaboration with TenOneTen Ventures

This year we are excited to add something new to Big Data Day LA. In partnership with TenOneTen Ventures, we are bringing together some of the best data driven startups in Southern California! Five startups will have the opportunity to pitch to a panel of judges that range from VCs to data experts. The winner will receive a $1500 cash prize, $1,000 in MongoDB Atlas DBaaS credit and 3 strategy sessions with VCs. If you are interested in applying to pitch, learn more about the requirements and apply here. The deadline to apply is July 8th, 2017

Register / 2017 Registration Now Open

Register

About the Conference / What You Need To Know

Big Data Day LA is the largest of its kind, and completely free, Big Data conference in Southern California. Spearheaded by Subash D’Souza and organized and supported by a community of volunteers, sponsors and speakers, Big Data Day LA features the most vibrant gathering of data and technology enthusiasts in Los Angeles.

The first Big Data Day LA conference was in 2013, with just over 250 attendees. We have since grown to over 550 attendees in 2014, 950+ attendees in 2015 and 1200+ attendees in 2016!

Our 2017 session tracks are:

  • Big Data
  • Data Science
  • Hadoop / Spark / Kafka
  • NoSQL
  • Use Case Driven
  • IoT (New)
  • Entertainment (New)
  • AI/ Machine Learning (New)

Subscribe to our mailing list

* indicates required

Attendees / See Who Will Be There

/ Data Scientists

/ Software Developers

/ System Architects

/ Head Researchers

/ Business Analysts

/ Data Engineers

/ Technical Leads

/ CEOs, CTOs, CIO, etc.

/ IT Managers

/ Business Strategists

/ Data Analysts

/ Researchers

/ Head Data Scientists

/ Entrepreneurs

/ Consultants

Sponsors / 2017 Sponsors

We are working on several awesome sponsors for BDDLA 2017! Email Subash D’Souza at sawjd@yahoo.com if you are interested in being a sponsor for our 2017 event.

2017 - University of Southern California USC
2017 - Aviana Global2017 - Hortonworks2017 - Netflix2017 - RedisLabs2017 - Vertica
2017 - Bigg Data2017 - Data Application Lab2017 - Datascience2017 - GumGum2017 - Mesosophere2017 - Qubole2017 - Walt Disney
2017 - DataXceed2017 - Snowflake2017 - Vupico
2017 - Archangel Technology Consultants, LLC2017 - Javascript.LA2017 - Los Angeles Big Data Users Group2017 - RMDS2017 - TenOneTen Ventures

Keynote Speakers / 2017

Abbass Sharif, PhD.

Academic Director, Master of Science in Business Analytics Program at the University of Southern California
Abbass is a professor of data science at USC Marshall School of Business, and he is the director of the MS in Business Analytics program. Professor Sharif specializes is in the field of statistical computing and data visualization and he has developed and published new multivariate visualization techniques for functional data, and currently he is developing visualization techniques to study brain activity data collected via the near-infrared spectroscopy (NIRS) technology. Professor Sharif teaches statistics courses that range from introductory statistics to data analysis for decision-making through to advanced modern statistical learning techniques, statistical computing and data visualization.

Ben Welsh

Editor at Los Angeles Times Data Desk
Ben Welsh is the editor of the Los Angeles Times Data Desk, a team of reporters and computer programmers in the newsroom that works to collect, organize, analyze and present large amounts of information. He is also a cofounder of the California Civic Data Coalition, an open-source network of developers working to open up public data, and the creator of PastPages, an archive dedicated to the preservation of online news. Ben has worked at the Los Angeles Times since 2007.  Before working at The Times, Ben conducted data analysis for investigative projects at The Center for Public Integrity in Washington DC. Projects he has contributed to have been awarded the Pulitzer Prize, the Library of Congress' Innovation Award and numerous other prizes for investigative reporting, digital design and online journalism. Ben graduated from DePaul University in 2004. During his time there, he worked with Carol Marin and Don Moseley at the DePaul Documentary Project. He later earned a master’s degree from the Missouri School of Journalism — where he served as a graduate assistant at the National Institute for Computer-Assisted Reporting. He is originally from Swisher, Iowa.

David Waxman

Managing Director at TenOneTen Ventures
David Waxman is Managing Partner at TenOneTen Ventures, an LA based venture capital firm that invests in data driven businesses. Prior to TenOneTen David has nearly two decades of experience as a technology entrepreneur. After graduating with a masters degree from MIT’s Media Lab in 1995, David co-founded Firefly, an early pioneer in personalization and privacy technology. Firefly was acquired by Microsoft in 1998 where the company’s flagship product became the Microsoft Passport, the web’s first unified authentication and identity platform. After Firefly, David co-founded PeoplePC, a company dedicated to simplifying the process of joining the online world. PeoplePC served over 600,000 individual subscribers as well as Fortune 100 corporations such as Ford Motor Company, Vivendi Universal and Delta Air Lines. PeoplePC went public in 2001 and was acquired by Earthlink in 2002. In 2005, David co-founded SpotRunner, a Los Angeles-based technology company that worked to revolutionize the way advertising was created, planned, bought and sold. Since leaving Spot Runner, David has dedicated his time  as an active mentor, speaker and investor to helping entrepreneurs realize their goals.

Eric Anderson

Product Manager at Google
Eric is a product manager at Google on data processing and analytics working mostly on Google Cloud Dataflow and Dataprep. Previously he worked at AWS on EC2 and at General Electric. He's on the Project Management Committee of the Alluxio open source project and speaks often about Apache Beam. He studied engineering at the University of Utah and received an MBA from Harvard.

Espree Devora

Founder and Creator at WeAreLATech
Espree Devora (espreedevora.com) "the Girl who Gets it Done" is the Producer and Host of WeAreLATech , "Hello Customer" and #womenintech Podcasts. All have hit #1 on iTunes New & Noteworthy across all categories. She has run the monthly LA podcasters meetup (https://www.meetup.com/Los-Angeles-Podcasters-Meetup/) since 2014. WeAreLATech (wearelatech.com) unites the LA tech community via the podcast, a calendar of all tech events happening in the city and an offline 'Experience' Club. The Club provides people working in tech curated activities to step away from the computer like horseback riding, escape rooms, food tours, archery and more. She has given talks on entrepreneurship to many organizations including USC Business School, CBS, South by Southwest and Georgetown MBA. Most recently she was listed by INC Magazine as one of the top 30 women in tech to follow.
  • WeAreLATech Podcast: wearelatech.fm
  • #womenintech Podcast: womenintech.fm

Ian Swanson

Founder and CEO at DataScience
An expert in analytics and data science, an accomplished entrepreneur, and a successful executive for such Fortune 500 companies as American Express and Sprint, Ian Swanson is at home in both startups and enterprise-level organizations. Swanson is currently the CEO and Founder of DataScience, a company that empowers data scientists with best-in-class tools, infrastructure, and expertise. Previously, he founded Sometrics, which launched the industry’s first global virtual currency platform in 2008 and was acquired by American Express in 2011. Prior to Sometrics, Swanson worked for the secure chat and messaging startup, Userplane, which was subsequently acquired by AOL.

Keith Camoosa

SVP, Data Intelligence at Warner Bros.
Keith oversees Data Intelligence for Warner Brothers. The Data Intelligence team aggregates and analyzes data generated from thousands of retailers and hundreds of millions of consumers interacting with Warner Brothers’ content on desktop, TV and mobile platforms. These data and insights are used to design, inform and optimize content, user experiences, advertising and general business decisions. Prior to joining Warner Brothers, Keith led the data & advanced analytics practice at IPG Mediabrands. Keith also worked at Yahoo!, TNS and Deutsch. His expertise includes consumer intelligence, business intelligence, attribution & optimization, database & programmatic marketing enablement.

Lauren Moores

Director, Data Analytics at Tala
Lauren Moores is a data geek and strategist who uses dirty, disparate data to optimize business practices with relevant metrics and insightful stories. Lauren has over 20 years in data and technology strategy, science and data creation in various information and tech industries, including a PhD in Economics from Brown University.  Lauren currently runs data analytics at Tala, a global fintech startup that transforms mobile behavioral and transactional data to provide financial services in emerging economies. She is also a current member of the data advisory board for the USA for UNHCR, a non-profit which supports the UN Refugee Agency. Most recently, Lauren was chief evangelist for Dstillery, a cross-device digital intelligence platform, and chief science officer for EveryScreen Media, a mobile RTB and audience technology company acquired by Dstillery. Prior to EveryScreen, Lauren managed the science, data assets and data engineering teams at WPP’s Kantar Compete, building large, global, multi-sourced digital consumer research panels and consumer analytics. Lauren is an active speaker and writer on innovation through mobile data, technology and analytics, including prior lecturing at Brown, NYU and UCLA.

Maggie Jan

Director, Developer Advocacy at Keen.IO
Maggie Jan is the Director, Developer Advocacy at the analytics company Keen IO. In addition to running Keen IO internal analytics, she teaches classes on analytics, helps customers, and provides analytics mentorship. She also works part-time as a Doctor of Optometry. Prior to Keen IO, Maggie was a Technical Architect at Accenture performing large-scale SAP implementations for Fortune 500 companies.

Mary Bui-Pham

VP of Operations for Publisher Products at Yahoo
Mary is responsible for all day-to-day operations of the Publisher Products team that is responsible for Yahoo's flagship properties such as Yahoo Homepage, Yahoo Sports, Yahoo Daily Fantasy, Yahoo Finance, Yahoo Weather, Yahoo News, Tumblr, Polyvore, and Flurry.    Mary is an 8-year Yahoo veteran and has held various leadership positions in Program Management, Design, and Business Operations.  Prior to joining Yahoo, Mary led Engineering, Quality, Program Management, and Release Management teams at eBay and DoubleClick. Mary holds a Ph.D. in Chemical Engineering from the University of California, San Diego, where she conducted computational modeling on Laminar Flames.

Mike Warren

Co-Founder & CTO at Descartes Labs
Mike’s past work spans a wide range of disciplines, with the recurring theme of developing and applying advanced software and computing technology to understand the physical and virtual world. He was a scientist at Los Alamos National Laboratory for 25 years, and also worked as a Senior Software Engineer at Sandpiper Networks/Digital Island. His work has been recognized on multiple occasions, including the Gordon Bell prize for outstanding achievement in high-performance computing. He has degrees in Physics and Engineering & Applied Science from Caltech, and he received a PhD in Physics from University of California, Santa Barbara.

Shala Arshi

Sr. Director, Technology Enabling at Intel & Co-Founder at Women in Big Data
Shala has extensive technical, management, marketing and business development background. She has been working at Intel in several roles which spanned from hands on engineering work in super computer and parallel processing with focus on message passing and networking aspect of the system to management running organization focused on operating system and driver development for Intel server platforms before moving to Intel capital. That shift provided Shala with great business insight and equity investment opportunities. Shala is now focused on marketing, promotion and communication for several products and technologies within Intel Software System Technologies and Optimization Division. Shala is also a big advocate for gender equity and one of the co-founders of Women in Big Data.  Shala has BS and MS in Computer Science from Oregon State University.

Session Speakers / 2017

We are accepting Tech Talk Proposal submissions for 30 min (incl. Q&A) sessions in the following tracks:

  • Big Data
  • Data Science
  • Hadoop / Spark / Kafka
  • IoT (New!)
  • Entertainment (New!)
  • AI/ Machine Learning (New!)
  • NoSQL and
  • Use Case Driven

The deadline for talks has closed.

Adam Mollenkopf

Real-Time & Big Data GIS Capability Lead at ESRI

Anand Ranganathan

VP of Solutions at Unscrambl

Andrea Trevino

Lead Data Scientist at DataScience

Andrew Psaltis

HDF/IoT Product Solutions Architect at Hortonworks

Andrew Waage

Co-Founder at Retention Science

Annette MartinezNovo

Director, Benefits at Entertainment Partners

Armen Donigian

Data Science Engineer at Zest Finance

Avinash Deshpande

Chief Software Architect at Logitech

Basavaraj Soppannavar

Technology & Marketing, GridDB at Toshiba America

Ben Coppersmith

Data Engineering Manager at Factual

Brendan Herger

Machine Learning Engineer at Capital One

Brian Bulkowski

CTO and Founder At Aerospike

Brian Dolan

Founder & Chief Scientist at Deep 6 AI

Brian Kursar

VP, Data Intelligence at Warner Bros

Brinkley Warren

Chief Marketing Officer & Co-Founder at Quantiacs

Chelsea Ursaner

Solutions Architect at Office of LA Mayor Eric Garcetti

Colin McCabe

Software Engineer at Confluent

David Hsieh

SVP Marketing at Qubole

Dylan Rogerson

Senior Data Scientist at Activision

Emad Hasan

Chief Operating Officer & Co-Founder at Retina AI

Eric Ruiz

LatAm Marketing Manager at Google Waze

Gene Pang

Software Engineer/ Founding Member at Alluxio

Henry Pak

Solutions Architect at Elastic

Hicham Mhanna

Vice President of Engineering at BCG Digital Ventures

Huiyu Deng

Research Assistant at USC

Ingo Mierswa

Founder & President at RapidMiner

Irina Kukuyeva

Senior Data Scientist at Dia &Co

Jason Lee

Principal, Advanced Analytics Group at Bain & Company

Jeff Weintraub

Vice President at theAmplify

Jerry Power

Executive Director, CTM At USC

Jill Dyche

Vice President, Best Practices at SAS

John De Goes

CTO at SlamData

John Sullivan

Director of Innovation at SAP NA

Josh Hemann

Director - Content Data Engineering & Analytics at Netflix

Joshua Poduska

Senior Data Scientist, Big Data Group at HPE Vertica

Jules Damji

Spark Community Evangelist at Databricks

Karthik Ramasamy

Co-Founder at Streamlio

Kivanc Yazan

Software Engineer at ZipRecruiter

Konstantin Boudnik

Chief Technologist at EPAM Systems

Kuassi Mensah

Director, Product Management at Oracle

Kurt Brown

Director, Data Platform at Netflix

Lawrence Spracklen

VP of Engineering at Alpine Data

Lilian Coral

Chief Data Officer at Office Of LA Mayor Eric Garcetti

Lynn Langit

Big Data & Cloud Architect at Lynn Langit Consulting

Manas Bhat

Director - Finance and Strategy at Guitar Center

Matt Chapman

Senior Data Engineer at Tronc

Mehrdad Yazdani

Data Scientist at Open Medicine Institute & UCSD

Michael Lee Williams

Director of Research at Fast Forward Labs

Michael Limcaco

CTO at Agilisium Consulting

Michael Tiernay

R&D Data Scientist at Edmunds.com

Mohit Mehra

Lead Data Engineer at Children's Hospital Los Angeles (CHLA)

Monica Willbrand

Senior Business Consultant at Tableau Software

Neal Fultz

Principal Data Scientist, Optimization at OpenMail

Noelle Saldana

Principal Data Scientist at Pivotal

Olin Hyde

Founder & CEO at Leadcrunch AI

Peter Zaitsev

CEO at Percona

Raman Marya

Director, Data Engineering and Analytics at OpenTable

Roman Shaposhnik

VP Technology at ODPi

Roopal Garg

Data Scientist at GumGum

Roozbeh Davari

Data Scientist at The Honest Company

Sig Narvaez

Senior Solutions Architect at MongoDB

Stuart McCormick

Americas Digital Services Leader at Honeywell

Suresh Paulraj

Cloud Data Solution Manager/Architect at Microsoft

Tague Griffith

Developer Advocate at RedisLabs

Ted Malaska

Technical Group Architect at Blizzard Entertainment

Tom Webster

CEO at TONE

Vartika Singh

Senior Solutions Architect at Cloudera

Ward Bullard

Head of Product, Venues at Verizon

Yves Bergquist

Director, Data & Analytics Program at USC/ETC

Sessions / Track & Session Info

/ Big Data

Extending Analytic Reach - From The Warehouse to The Data Lake

by Michael Limcaco, CTO, Agilisium

The data marts and warehouses we work with often require us to think about how to scope our analytic questions based on the finite amount of storage allocated to these enterprise components. With new innovations in the cloud space, we can leverage the near-infinite storage capacities of Data Lake object storage and use this as foundational source that can be combined with online data in the warehouse. In this talk we present reference architecture patterns based on Amazon Redshift Spectrum - a new technology enabling you to run MPP Warehouse SQL queries against exabytes of data in a backing object store. With Redshift Spectrum, customers can extend the analytic reach of their SQL interactions to push beyond data stored on local disks in the data warehouse to query vast amounts of unstructured data in the Amazon S3 Data Lake-- without having to load or transform any data.

Spark Pipelines in the Cloud with Alluxio

by Gene Pang, Software Engineer, Alluxio

Organizations commonly use Big Data computation frameworks like Apache Hadoop MapReduce or Apache Spark to gain actionable insight from their large amounts of data. Often, these analytics are in the form of data processing pipelines, where there are a series of processing stages, and each stage performs a particular function, and the output of one stage is the input of the next stage. There are several examples of pipelines, such as log processing, IoT pipelines, and machine learning. The common attribute among different pipelines is the sharing of data between stages. It is also common for data pipelines to process data stored in the public cloud, such as Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Storage. The global availability and cost effectiveness of these public cloud storage services make them the preferred storage for data. However, running pipeline jobs while sharing data via cloud storage can be expensive in terms of increased network traffic, and slower data sharing and job completion times. Using Alluxio, a memory speed virtual distributed storage system, enables sharing data between different stages or jobs at memory speed. By reading and writing data in Alluxio, the data can stay in memory for the next stage of the pipeline, and this result in great performance gains. In this talk, we discuss how Alluxio can be deployed and used with a data processing pipeline in the cloud. We show how pipeline stages can share data with Alluxio memory for improved performance benefits, and how Alluxio can improves completion times and reduces performance variability for pipelines in the cloud.

Big Data in Pediatric Critical Care

by Mohit Mehra, Lead Data Engineer, Childrens Hospital Los Angeles

There is an urgent need in the pediatric ICUs to collect, store and transform healthcare data to make accurate and timely predictions in the areas of patient outcomes and treatment recommendations. We are currently heavily invested in using open source big data stacks in order to achieve this goal and help our young ones. In this talk I can highlight how we go about managing structured and unstructured high frequency data generated from a disparate set of devices and systems and ultimately how we have created data pipelines to process the data and make it available for data scientists and app developers.

Secure cloud environment for Data Science and Analytics

by Konstantin Boudnik, Chief Technologist, EPAM Systems

Secure concurrent access to the data from on-demand compute clusters is a huge organizational and technical challenge. A few people possess the skills, experience and enough of the security expertise to create and control a clean cloud environment equipped with the state-of-the-art data science, analytic and visualization technologies. Commercial products are often too late with features, deliver unknown quality and security guarantees, and carry a hefty license fees. Developed by EPAM engineers and publicly available under ALv2, DLab framework addresses all these concerns by providing an of the shelf, simple to use platform. DLab allows anyone to setup a completely secure cloud environment empowered with data science notebook software, scaleable cluster solution and powerful compute engine based on Apache Spark.

How OpenTable uses Big Data to impact growth

by Raman Marya, Director, Data Analytics and Data Engineering, OpenTable

We have created variety of Analytics Solutions combining data from our Data Lake with Traditional DW. Data API's which are fed into product for improving conversions, Churn prediction alogrithm to help account managers focus on high risk customers, using analytics as an edge to empower sales team to win prospective customers.

Building the modern data platform

by David Hsieh, Senior Vice President, Marketing, Qubole

The killer app for the public cloud is big data analytics. As IT evolves from a cost center to a true nexus of business innovation, data engineers, platform engineers and database admins need to build the enterprise of tomorrow. One that is scalable, and built on a totally self-service infrastructure. Having an agile, open and intelligent big data platform is key to this transformation. Join Qubole for an overview of the 5 stages of transformation that companies need to follow.

Big Data for Good

by Jill Dyche, Vice President, Best Practices, SAS

In this keynote, SAS Vice President and non-profit founder Jill Dyche revisits the customer journey. After all in the age of omnichannel and digital everything, your customers are taking a different path than the one your Marketing department mapped out all those years ago. But big data's reach transcends the unstructured data of big corporations. Jill will explain how a personal mission led her to the realization that big data can be applied to many different journeys. She'll tell a story of how her work in the social sector with big data and analytics not only helps animal shelters refine outreach programs, but can save lives!

Weibull Analysis: Tableau + R Integration

by Monica Willbrand, Senior Business Consultant, Tableau

Weibull reliability analysis predicts the life of products by fitting a distribution to a plot based on a population of units; multiple proprietary software applications are available to perform the analysis. The advent of Tableau + R Integration empowers Data Scientists and Reliability experts to make inferences drawn on populationsäó» failure characteristics by considering the ‘_- value of the distribution. With ‘_, we plot F(t), or unreliability over time, when leveraging Tableau + R Integration (Rscripts in Tableau calculated fields, pointing to R Server library for row-level execution). The Weibull analysis performed is superior to the Kaplan-Meiers method as it enables the more accurate Maximum Likelihood Estimate (MLE) curve fitting of plotted regression as opposed to Least Squares Estimate (LSE), which excludes R Integration and fails to precisely match parameters (shape, slope) that sophisticated existing reliability software packages produce. Application of Weibull for reliability analysis considers failure for given time in lifespan (t) when t= miles, cycles, hours, etc. The two-parameter distribution performed in this analysis includes beta and eta, or shape and scale parameters, respectively. Mean Time To Failure (MTTF) calculations are derived from these parameters as well. Variable Confidence Interval (CI) bands are used and can be adjusted using the interactive Tableau visualization. Industries utilizing Weibull analysis to plot Bathtub Curve assess the infant mortality, normal useful life, and end of life failures anticipated for a product (e.g. semiconductor chips, automotive parts, medical devices).

/ Data Science

Opening the black box: Attempts to understand the results of machine learning models

by Michael Tiernay, R&D Data Scientist, Edmunds.com

Sophisticated machine learning models (like GBMs and Neural Networks) produce better predictions than simpler models (like linear or logistic regression), but sophisticated models do not produce interpretable 'effects' that specify the relationship between predictors and and outcome. This is because sophisticated models can learn non-linear, interactive, or even higher level relationships between the predictors and outcome without being explicitly specified. In many settings it is often important to understand, as best as possible, how 'black box' models are producing because:1. If users do not understand how a prediction is being made, they may not trust the model/prediction enough to act upon the model's suggestions. Significant business value can be derived from understanding what drives an outcome of interest (e.g. purchase or churn) in order to make product changes to accentuate or minimize desired effects 3. Understanding how predictors relate to an outcome can inform subsequent feature generation that can improve a model's predictive power. This talk will discuss two methods that have been proposed to better understand machine learning models: simulating changes in input variables (the R ICEbox package) or building a simpler model locally around specific predictions (the Python LIME package).

Probabilistic programming products

by Michael Lee Williams, Director of Research, Fast Forward Labs

Algorithmic innovations like NUTS and ADVI, and their inclusion in end user probabilistic programming systems such as PyMC3 and Stan, have made Bayesian inference a more robust, practical and computationally affordable approach. I will review inference and the algorithmic options, before describing two prototypes that depend on these innovations: one that supports decisions about consumer loans and one that models the future of the NYC real estate market. These prototypes highlight the advantages and use cases of the Bayesian approach, which include domains where data is scarce, where prior institutional knowledge is important, and where quantifying risk is crucial. Finally I'll touch on some of the engineering and UX challenges of using PyMC3 and Stan models not only for offline tasks like natural science and business intelligence, but in live end-user products.

Deep Learning for Natural Language Processing

by Roopal Garg, Data Scientist, GumGum Inc

The talk will focus on how Neural Networks are applied in the field of NLP for tasks like classification. Building blocks like Word Embeddings, Recurrent NN, LSTM, GRU, Convolutional NN, Sentence Representation and how they are applied to a piece of text in Tensorflow will be covered. These building blocks can be stacked together in various ways to form deeper network architectures. We will discuss one such architecture which is used within GumGum Inc to do Sentiment Analysis on web pages using NN in Tensorflow.

Machine Learning in Healthcare

by Mehrdad Yazdani, Data Scientist , Open Medicine Institute and UC San Diego

Using Machine Learning to Identify Major Shifts in Human Gut MicrobiomeProtein Family Abundance in Disease Inflammatory Bowel Disease (IBD) is an autoimmune condition that is observed to be associated with major alterations in the gut microbiome taxonomic composition. Here we classify major changes in microbiome protein family abundances between healthy subjects and IBD patients. We use machine learning to analyze results obtained previously from computing relative abundance of ~10,000 KEGG orthologous protein families in the gut microbiome of a set of healthy individuals and IBD patients. We develop a machine learning pipeline, involving the Kolomogorv-Smirnov test, to identify the 100 most statistically significant entries in the KEGG database. Then we use these 100 as a training set for a Random Forest classifier to determine ~5% the KEGGs which are best at separating disease and healthy states. Lastly, we developed a Natural Language Processing classifier of the KEGG description files to predict KEGG relative over- or under- abundance. As we expand our analysis from 10,000 KEGG protein families to one million proteins identified in the gut microbiome, scalable methods for quickly identifying such anomalies between health and disease states will be increasingly valuable for biological interpretation of sequence data.

How to Ruin your Business with Data Science & Machine Learning

by Ingo Mierswa, Founder & President, RapidMiner

Everyone talks about how machine learning will transform business forever and generate massive outcomes. However, it's surprisingly simple to draw completely wrong conclusions from statistical models, and correlation does not imply causation is just the tip of the iceberg. The trend of the democratization of data science further increases the risk for applying models in a wrong way. This session will discuss. How highly-correlated features can overshadow the patterns your machine learning model is supposed to find this leads to models which will perform worse in production than during model building. How incorrect cross-validation lead to over-optimistic estimations of your model accuracy, especially we will discuss the impact of data preprocessing on the accuracy of machine learning models. How feature engineering can lift simple models like linear regression to the accuracy of deep learning but comes with the advantages of understandability & robustness

Data Science: Good, Bad and Ugly

by Irina Kukuyeva, Senior Data Scientist, Dia & Co

As a data scientist, I get to see a broad spectrum of the 'good', 'bad', and 'ugly' implementations of engineering and data practices. I'd be happy to share my tips and experiences with the broader community: the do's and don'ts of working with data in production, for collaboration, and for getting actionable insights.

Data Augmentation and Disaggregation

by Neal Fultz, Principal Data Scientist, Optimization at OpenMail

Machine learning models may be very powerful, but many data sets are only released in aggregated form, precluding their use directly. Various heuristics can be used to bridge the gap, but they are typically domain-specific. The data augmentation algorithm, a classic tool from Bayesian computation, can be applied more generally. We will present a brief review of DA and how to apply it to disaggregation problems. We will also discuss a case study on disaggregating daily pricing data, along with a reference implementation R package.

Deriving Conversational Insight by Learning Emoji Representations

by Jeff Weintraub, Vice President, theAmplify

It is a rare occurrence to observe the rise of a new language amongst a population. It is an even more rare occurrence to observe the adoption of such a language on a global scale. Since the introduction of the emoji keyboard on iOS in 2011, the use of emojis in textual communication has steadily grown into a common vernacular on social media. As of April 2015, Instagram reported that nearly half of all text contained emojis and, in some countries, over 60% of texts contained emoji characters. For power users of social media as well as for marketers looking for audiences on these platforms, it is becoming increasingly imperative to capture emoji data and derive insight from its use; to better understand what intent or meaning the usage carries in the conversation. Jeff Weintraub, VP of Technology at theAmplify, a creative Brandtech Influencer Service and a subsidiary of You & Mr Jones, the World's First Brandtech Group, will briefly summarize the data science behind learning emoji representations and also present recent trends in emoji usage within the context of advertising and branded marketing campaigns on social media.

/ Hadoop / Spark / Kafka

Operationalizing Data Science with Apache Spark

by Lawrence Spracklen, VP of Engineering, Alpine Data

Today, in many data science projects, the sole focus is the complexity of the algorithms being used to address the data problem. While this is a critical consideration, without consideration of how the insights developed can be disseminated through the broader enterprise, many end up dying on the vine. This presentation will highlight not only that turnkey model operationalization strategy is critical to the success of enterprise data science projects, but how this can be achieved using Spark. Today Spark enables data scientists to perform sophisticated analyses using complex machine learning algorithms. Even when the size of the datasets are measured in Terabytes, Spark provides a broad selection of machine learning algorithms that can scale effortlessly. However, the current process for the business to leverage the results of these analyses is far less sophisticated. Indeed, results are frequently communicated by powerpoint presentation, rather than a turn-key solution for deploying improved models into production. In this session, we discuss the current challenges associated with operationalizing these results. We discuss the challenges associated with turnkey model operationalization, including the shortcomings of model serialization standards such as PMML for expressing the complex pre- and post- data processing that is critical to effortless operationalization. Finally, we discuss in detail the potential for turnkey model operationalization with the emerging PFA standard, and highlight how the use of PFA can be achieved using Spark, including how PFA model scoring can be supported using Spark streaming, and our efforts to drive support for PFA model export into MLlib.

Deep Learning Frameworks Using Spark on YARN

by Vartika Singh, Senior Solutions Architect, Cloudera

Traditional machine learning and feature engineering algorithms are not efficient enough to extract complex and nonlinear patterns hallmarks of big data. Deep learning, on the other hand, helps translate the scale and complexity of the data into solutions like molecular interaction in drug design, the search for subatomic particles and automatic parsing of microscopic images. Co-locating a data processing pipeline with a deep learning framework makes data exploration/algorithm and model evolution much simpler, while streamlining data governance and lineage tracking into a more focused effort. In this talk, we will discuss and compare the different deep learning frameworks on Spark in a distributed mode, ease of integration with the Hadoop ecosystem, and relative comparisons in terms of feature parity.

Building Microservices with Apache Kafka

by Colin McCabe, Software Engineer, Confluent

Building distributed systems is challenging. Luckily, Apache Kafka provides a powerful toolkit for putting together big services as a set of scalable, decoupled components. In this talk, I'll describe some of the design tradeoffs when building microservices, and how Kafka's powerful abstractions can help. I'll also talk a little bit about what the community has been up to with Kafka Streams, Kafka Connect, and exactly-once semantics.

A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets.

by Jules Damji, Spark Community Evangelist, Databricks Inc

Of all the developers delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs - RDDs, DataFrames, and Datasets available in Apache Spark 2.x. In particular, I will emphasize why and when you should use each set as best practices, outline its performance and optimization benefits, and underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you'll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them.

VariantSpark a library for genomics

by Lynn Langit, Big Data & Cloud Architect, Lynn Langit Consulting

See the VariantSpark library in action on a Databricks Jupyter notebook

Turning Relational Database Tables into Hadoop Datasources

by Kuassi Mensah, Director - Product Management, Oracle Corp

This session presents a Hadoop DataSource implementation for integrating and joining Big Data with Master Data in RDBMS.

Data Science Out of The Box : Case Studies in the Telecommunications Industry

by Anand Ranganathan, Vice President of Solutions, Unscrambl

Telecommunications service providers (or telcos) have access to massive amounts of historical and streaming data about subscribers. However, it often takes them a long time to build, operationalize and gain value from various machine learning and analytic models. This is true even for relatively common use-cases like churn prediction, purchase propensity, next topup or purchase prediction, subscriber profiling, customer experience modeling, recommendation engines and fraud detection. In this talk, I shall describe our approach to tackling this problem, which involved having a pre-packaged set of analytic pipelines on a scalable Big Data architecture that work on several standard and well known telco data formats and sources, and that we were able to reuse across several different telcos. This allows the telcos to deploy the analytic pipelines on their data, out of the box, and go live in a matter of weeks, as opposed to the several months it used to take if they started from scratch. In the talk, I shall describe our experiences in deploying the pre-packaged analytic pipelines with several telcos in North America, South East Asia and the Middle East. The pipelines work on a variety of historical and streaming data, including call data records having voice, SMS and data usage information, purchase and recharge behavior, location information, browsing/clickstream data, billing and payment information, smartphone device logs, etc. The pipelines run on a combination of Spark and Unscrambl BRAINTM, which includes a real-time machine learning framework, a scalable profile store based on Redis and an aggregation engine that stores efficient summaries of time-series data. I shall describe some of the machine learning models that get trained and scored as part of these pipelines. I shall also remark on how reusable certain models are across different telcos, and how a similar set of features can be used for models like next topup or purchase prediction, churn prediction and purchase propensity across similar telcos in different geographies.

/ NoSQL

Real-Time Analytics in Transactional Applications

by Brian Bulkowski, Chief Technology Officer & Founder, Aerospike, Inc.

BI and analytics are at the top of corporate agendas. Competition is intense, and, more than ever, organizations require fast access to insights about their customers, markets, and internal operations to make better decisionsäóîoften, in real time. Enterprises face challenges powering real-time business analytics and systems of engagement (SOEs). Analytic applications and SOEs need to be fast and consistent, but traditional database approaches, including RDBMS and first-generation NoSQL solutions, can be complex, a challenge to maintain, and costly. Companies should aim to simplify traditional systems and architectures while also reducing vendors. One way to do this is by embracing an emerging hybrid memory architecture, which removes an entire caching layer from your front-end application. This talk discusses real-world examples of implementing this pattern to improve application agility and reduce operational database spend.

Anomaly Detection in Time-Series Data using the Elastic Stack

by Henry Pak, Solutions Architect, Elastic

The Elastic has released a commercial machine learning plugin that allows you to create a model of your time series data using an unsupervised machine learning approach. Walk through a few common use cases to see how this plugin may help with finding anomalies in your data.

Logitech Accelerates Cloud Analytics Using Data Virtualization

by Avinash Deshpande, Chief Software Architect, Logitech

Many firms are adopting a cloud first strategy and are migrating their on-premises technologies to the cloud. Logitech is one of them. We have adopted the AWS platform and big data on the cloud for all of their analytical needs, including Amazon Redshift and S3. In this presentation, I will present: The business rationale for migrating to the cloud. How data virtualization enables the migration. Running data virtualization itself in the cloud.

Serverless Architectures with AWS Lambda and MongoDB Atlas

by Sig Narvaez, Senior Solutions Architect, MongoDB

It's easier than ever to power serverless architectures with managed database services like MongoDB Atlas. In this session, we will explore the rise of serverless architectures and how they've rapidly integrated into public and private cloud offerings. We will demonstrate how to build a simple REST API using AWS Lambda functions, create a highly available cluster in MongoDB Atlas, and connect both via VCP Peering. We will then simulate load and use the monitoring and scale features of MongoDB Atlas and use MongoDB Compass to browse our database.

NoSQL on MySQL - MySQL Document Store

by Peter Zaitsev, Chief Executive Officer, Percona

Should you use SQL on NoSQL Engine ? With MySQL Document Store you can do both. In this talk we will introduce MySQL Document Store and discuss its advantages and downsides compared to purpose build Document Store database engines such as MongoDB

Quark: A Scala DSL For Data Processing & Analytics

by John De Goes, CTO, SlamData Inc.

Quark is a new Scala DSL for performing high-performance data processing and analytics on sources of structured and semi-structured data. Built with an advanced optimizing analytics compiler, Quark can push computation down to any supported data source, offering performance typically seen only with hand-written code. Quark's type-safe API allows developers to create correct pipelines and analytics, while the flexibility provides an ability to directly manipulate semi-structured data formats like JSON and XML. Quark has native support for MongoDB, MarkLogic, Couchbase, and any data source with a Spark connector.

Purpose-built NoSQL Database for IoT

by Basavaraj Soppannavar, Technology & Marketing - GridDB, Toshiba

The talk covers 1. Properties of IoT data and database requirements for handling it; 2. Introduces the purpose-built NoSQL database for IoT called GridDB; 3. IoT data and time-series data modeling; 4. Real-world deployed IoT use cases

Crowd Surfing Tweets

by Kivanc Yazan, Software Engineer, ZipRecruiter

It's easy to collect millions of tweets, but not so easy to get right ones! During my senior year at college, we built a tool that lets you "surf" through Twitter. We were able to catch on-the-fly hashtags and add them into our search query, automagically. We tested it during elections and sporting events for tweets in both Turkish and English.

/ Use Case Driven

Data is cheap; strategy still matters

by Jason Lee, Principal, Advanced Analytics Group, Bain & Company

What could a Strategy Consulting firm have to do with or say about big data? We see Big Data leading the way on new products but also disrupting our clients business processes and business models. For many clients and big data fans, the temptation is to think big data and machine learning disrupt the need for strategy. Just throw the data in the lake and a bunch of programmers with machine learning fishing poles and we will be done. Here is a rapid-fire review of what really happens Use case 1: Use case 2: Use case 3: What did we learn working with these clients? Strategy still matters: Data is cheap; attention is not. While data and computational power are increasingly plentiful, people have limited attention and energy. Complexity can kill not so much in the model itself but in how it affects processes and decisions. Data is not so cheap after all. We continue to underappreciate data architecture, governance, and engineering. These frequently take up most of the effort required for analytics success. Winning with Big Data is often less about the latest technology platform but in our strategy, culture, organizational capabilities, the way we implement algorithms, how we make decisions with data, and the impacts these have on employees and customers.

Delivering Quality Open Data

by Chelsea Ursaner, Solutions Architect, Office Of LA Mayor Eric Garcetti

The value of data is exponentially related to the number of people and applications that have access to it. The City of Los Angeles embraces this philosophy and is committed to opening as much of its data as it can in order to stimulate innovation, collaboration, and informed discourse. This presentation will be a review of what you can find and do on our open data portals as well as our strategy for delivering the best open data program in the nation.

Optimizing Online Advertising With Data Science

by Andrea Trevino, Lead Data Scientist, DataScience.com

Optimizing online display advertising is a complicated task that often requires a combination of domain knowledge and intuition on the part of a marketer. But with so much advertising occurring within an algorithmic marketplace, at large scale, and under deadline, getting the most of your advertising dollars can be optimized with data science. In this talk, we delve into a custom data science solution for optimizing Facebook advertising campaigns, and how that solution ultimately saves time and boosts profits far beyond what intuition alone can achieve.

Artisanal Data

by Ben Coppersmith, Data Engineering Manager, Factual, Inc.

We have lots of data at Factual. But to solve some of our harder problems, we need to get down and dirty with our data -- to examine, evaluate, and experience it (sometimes even smell it). This talk attempts to re-brand this kind of work with the new, alternative buzzword, "Artisanal Data". I review the "Artisanal Data" technologies and techniques we use at Factual, including how we document experiments so that they get read, evaluate failure modes and judge successes, and keep our annotation data as accurate as possible. With the right statistical precautions, Artisanal Data can use be used to more effectively and emotionally communicate impact of our data.

Big Data on The Rise: Views of Emerging Trends & Predictions from real life end-users

by Roman Shaposhnik, VP Technology, ODPi

There are some key trends that are emerging in 2017 within the Hadoop and Big Data ecosystem, which center around the increase of the use of cloud. These trends are the underpinning for a larger shift toward purpose-driven products positioned as the core of an organization's data strategy. Certainly there are examples of this in early adopters that are mature in their deployments, but what about those more traditional end user organizations in the midst of a digital transformation? How is the relationship between IT and the needs of the growing data science align for business development? What are their views on these trends? How are their organizations reconciling the needs and desires to pave a path forward? In this session, Roman will present ODPi's findings and end-user views of Big Data trends based on data from the ODPi End User Advisory Board (TAB). Audiences will get real end-user perspectives from companies such as GE about how they are using Big Data tools, challenges they face and where they are looking to focus investments - all from a vendor-neutral viewpoint.

Diversity in Data Science: why it's important and challenging

by Noelle Saldana, Principal Data Scientist, Pivotal

Data Scientists come in all shapes, sizes, and personalities, from perhaps a more diverse set of academic and industrial backgrounds than other jobs in tech. This talk explores ways to hire a team with complementary skill sets and backgrounds, the obvious and not-so-obvious benefits of diversity, and challenges teams face when learning to work together.

Democratizing Hedge Funds

by Brinkley Warren, Chief Marketing Officer & Co-Founder, Quantiacs.com

Learn how to leverage your data science skills to earn a fortune as a freelance quant in your spare time. At Quantiacs, we run the world's largest quant finance algorithm competition in the world, and host the world's only marketplace for quantitative trading algorithms. We provide 26 years of free financial data, an open-source toolkit in multiple languages, and access to investment capital. The audience will learn about the future of quantitative finance and how to cash in.

A Gentle Introduction to GPU Computing

by Armen Donigian, Data Science Engineer, ZestFinance

As data science continues to mature and evolve, the demand for more computationally extensive machines is rising. GPU Computing provides the core capabilities that data scientists today are looking for, and when implemented effectively, it accelerates deep learning, analytics and other sophisticated engineering applications. During this talk, Armen Donigian, Data Science Engineer at ZestFinance, will introduce the GPU programming model and parallel computing patterns, as well as practical implications of GPU computing, such as how to accelerate applications on a GPU with CUDA (C++/Python), GPU memory optimizations and multi GPU programming with MPI and OpenACC. As an example of how GPU programming can be implemented in real-life business models, Armen will present how ZestFinance has successfully tapped into the power of GPU Computing for the deep learning algorithm behind its new platform, Zest Automated Machine Learning platform (ZAML). Currently, ZAML is used by major tech, credit and auto companies to successfully apply cutting-edge machine learning models to their toughest credit decisioning problems. ZAML leverages GPU Computing for data parallelism, model parallelism and training parallelism.

/ Entertainment

Big Video Game Data: Leveraging Large Scale Datasets to Make the Most of In-game Decisions

by Dylan Rogerson, Senior Data Scientist, Activision Publishing Inc

A colleague of mine once asked Why would you ever need over 100k observations for ? It's surprising that many models we develop don't need much data to reach a good level of accuracy. In this talk we'll discuss how Activision leverages large datasets and feature space from the Call of Duty Series to build complex models. We'll also talk about how to transfer these learnings to more digestible simple models and how accuracy in these models translates to usable in-game action. Finally we'll showcase some of our model development pipeline and thoughts on when you really do need millions or billions of data points to make substantial improvements to the game.

Big Data at Blizzard

by Ted Malaska, Group Architect of Engineering Systems, Blizzard

Blizzard the creating of leading games Hearthstone, World of Warcraft, Overwatch and more has BIG Data. In this session we will give a glimpse into some of the awesome things we are doing at Blizzard to make our games better through big data.

How EP used Big data to Solve the Entertainment Industry's ACA Compliance Requirement

by Annette Novo, Director, Benefit Solutions, Entertainment Partners

Imagine an industry that does not have an HR record for its employees. How do you comply with the Affordable Care Act's (ACA) health insurance eligibility determination when you don't know when someone started or stopped working for a particular company? That is the situation that the entertainment industry faced in 2013 as the ACA loomed on the horizon. Entertainment Partners, the largest provider of payroll and other related services to the entertainment industry for its production workforce set out to solve the problem. We coordinated across all of the industry's payroll providers, created a data analytics engine that ingests, aggregates and analyzes millions of transactions and determines which of their production workers meets the ACA eligibility criteria. We help the industry stay in compliance and avoid costly government penalties and we used Big Data to solve the problem.

The AI Takeover in Hollywood

by Yves Bergquist, Director, Data & Analytics, Entertainment Technology Center at USC

As the entertainment industry faces a landscape of exponential opportunities and threats, it is quietly turning to artificial intelligence to manage risk, develop operational efficiencies, and make more data-driven decisions. From developing cognitive solutions to assess why we think certain films and characters are more interesting than others, to isolating granular, scene-level story and character mechanics that drive better box office returns, Hollywood has fully caught up on other industries in leveraging high-end analytics methods and tools. _As the director of the Data & Analytics Project at USC's prestigious Entertainment Technology Center (created by George Lucas in 1993), Yves Bergquist sits at the center of this revolution. He and his team are developing next-generation AI tools and methods that are being deployed throughout the entertainment industry. Because his research is funded by all 6 Hollywood studios, and he personally answers to all the CTOs of those studios, Yves has unique and powerful insight into how Hollywood is quietly using machine intelligence to take its hit-making game to the next level. What the audience will learn: the audience will go behind the scenes to discover how precisely Hollywood studios are using data, analytics and AI to make better development, production and distribution decisions. Yves will draw from his and his team's research and use case studies to lift the veil on how AI, game theory, and neuroscience are transforming audience intelligence, film development, and distribution strategies.

Application of Data Science in Specialty Retail

by Manas Bhat, Director, Finance and Strategy, Guitar Center

Guitar Center is the largest retailer of musical instruments in the US with over 275 stores and an annual revenue of over $2.1B. My talk will focus on how data is leveraged here to drive optimal decision making. We have a data warehouse that collects information on transactions, traffic, products, inventory, customer and many more from stores and online. During the talk, I will walk through various insights that we have derived by analyzing the data. In one project, we blended Experian provided Household level data with our internal data to build customer profiles on purchasers of high-end guitars. We used this information with drive-time analysis to pick stores to build Platinum rooms exclusively dedicates to high-end guitars. I will also run through how we use extensive experimentation to test strategies before a chain wide roll out. We pick a few stores to pilot, use KNN to select like stores and perform a pre/post analysis to evaluate lift and its statistical significance.

The Netflix data platform: Now and in the future

by Kurt Brown, Director, Data Platform, Netflix

The Netflix data platform is constantly evolving, but at it's core, it's an all-cloud platform at a massive scale (60+ PB and over 700 billion new events per day), focused on enabling developers. In this talk, we'll dive into the current (data) technology landscape at Netflix, as well as what's in the works. We'll cover key technologies, such as Spark, Presto, Docker, and Jupyter, along with many broader data ecosystem facets (metadata, insights into jobs run, visualizing big data, etc.). Beyond just tech, we'll also dive a bit into our data platform philosophy. You'll leave with insights into how things work at Netflix, along with some ideas for re-envisioning your data platform.

Engineering a Flexible Recommender System for The L.A. Times

by Matt Chapman, Senior Data Engineer, Tronc, Inc.

A walk through of the architecture designed at Tronc for A/B Testing multiple algorithms for delivering personalized content recommendations to 60 million readers a month.

Spark, ElasticSearch, and Murmur3 Hash

by Brian Kursar, Vice President, Data Intelligence, Warner Bros.

How Warner Bros. is Using Elastic to Solve Entertainment and Media Problems at Scale Warner Bros. processes billions of records each day Globally between its web assets, digital content distribution, OTT streaming services, online and mobile games, technical operations, anti-piracy programs, social media, and retail point of sale transactions. Despite having large MPP clusters, a significant amount of dark data remained trapped in Web Logs. In this presentation, we will discuss how Warner Bros. leveraged the new Elastic 5 stack coupled with Apache Spark, to deliver scalable insights and new capabilities to support business needs.

/ AI/ Machine Learning

Disrupting Corporates with AI

by Hicham Mhanna, Vice President, Engineering, BCG Digital Ventures

Discussion of various applied cases of artificial intelligence solutions stemming from our venture work across verticals including consumer, financial and media.

Automating Legal Fulfillment with SparkML

by Brendan Herger, Machine Learning Engineer, Capital One

Capital One receives thousands of legal requests every year, often as physical mail. During this talk, we'll dive into how the Center for Machine Learning at Capital One have built a self contained platform for summarizing, filtering and triaging these legal documents, utilizing Apache projects.

A Practical Use of Artificial Intelligence in the Fight Against Cancer

by Brian Dolan, Founder & Chief Scientist, Deep 6 AI

Artificial Intelligence is an important topic in the fight against cancer. Clinical Trails are at the frontier of innovation. I will discuss techniques, data sets and platforms we use at Deep 6 to bring patients to clinical trials. The focus will be on practical, repeatable methods I've developed at MySpace, Greenplum, UCLA and the US Intelligence Community.

How AI Is Transforming B2B Sales & Marketing

by Olin Hyde, Chief Executive Officer & Founder, LeadCrunch.ai

Artificial intelligence blurs the lines between sales and marketing by enabling humans to leverage the power of Big Data to command and control every customer's journey. A new category of technology called "intelligent demand generation" enables marketers to explain and predict buyer behavior with unprecedented precision and speed. These capabilities reframe how companies go to market by enabling microtargeting of customers with context-specific content marketing. Early adopters of intelligent demand generation technologies are realizing more than 500% return on investment within 2 months. Three-time AI startup founder and CEO of LeadCrunch describes how his company developed military targeting technology then modified it to make commercial sales teams more efficient.

Generalized B2B Machine Learning

by Andrew Waage, Co-Founder, Retention Science

In this talk, we propose a generalized machine learning framework for e-commerce businesses. The framework is responsible for over 30 different user-level predictions including lifetime value, recommendations, churn predictions, engagement and lead scoring. These predictions provide a vital layer of intelligence for a digital marketer. Kinesis is used to capture browsing information from over 120M users across 100 companies (both in-app and web). A data processing and feature engineering layer is build on Apache Spark. These features provide inputs to predictive models for business applications. Different models each for Churn, Lifetime value, Product recommendation and search are written on Spark. These models can be plugged into any marketing campaign for any integrated e-commerce company leading to a generalized system. We finally present a monitoring system for machine learning called RS Sauron. This system provides more than 200 objective metrics measuring the health of predictive models, and depicts KPIs for model accuracy in a continual setting.

Building Autopilots for Business: Leveraging Flight Science to create new Data Science Frameworks

by Emad Hasan, Chief Operating Officer & Co-Founder, Retina.ai

We will discuss how leveraging Control System Theory, which informed and lead to the advent of flight control systems, is ushering in a new data-science framework for bringing automated insights to Enterprise. As organizations grow, managers start relying on data from several sources to make decisions concerning the organization and its customers. Organizations with a growing gap between desired results and actual results need a control system to better manage and predict results using their existing data. Control Theory is typically used in two forms: (1) Goal Seeking Model: For example autopilot guiding an aircraft from A to B. We will show how an organization looking to increase profits can be modeled as a Goal Seeking organization. (2) Disturbance Rejection Model; Sometimes seen in temperature control systems in buildings, in Enterprise an organization seeking to minimize costs, can be modeled using the Disturbance Rejection model. Both of these concepts are based on sound scientific principles that can be used to model and control businesses. The talk will include two real-world cases of how Emad Hasan translated these ideas into data science applications at Facebook and PayPal which powered executive decision making, as well as how they can now be used by data scientists and managers around the world to improve insights and cues for business executives.

Using machine learning to optimize marketing ROI at Honest Company

by Roozbeh Davari, Data Scientist, The Honest Company

The Honest Company is one of the leading national e-commerce brands, focused on promoting healthy and happy lives. One core challenge we face is measuring ROI across the numerous marketing acquisition channels and in turn, optimizing our budget allocation. Luckily, we have large amounts of data, both structured and unstructured, which is used to learn patterns and insights about how we acquire new customers. We leverage this data to build machine learning models for smart customer segmentation, which helps the acquisition team derive maximum ROI out of every dollar spent. In this talk, we will touch on some of our machine learning approaches and how we leverage data to predict attributes like customer lifetime value and customer churn rates, which are then used to optimize spend allocation.

Big data and health sciences: Machine learning applications in chronic and acute upper respiratory illness

by Huiyu Deng, Research Assistant, USC

Big data has become the new hot topic in recent years. It promotes the understanding of the exploit of data and directs the decision guidance in many sectors. The health science field is also shaped by the innovative idea of big data application. Our study group from the department of preventive medicine of the Keck school of medicine of the University of Southern California aims to build a big data architecture that combines and analyzes data of people from difference sources and provide health related assessments back to them. Specifically, ecological momentary assessments (EMAs), electronic medical records (EMRs), and real-time air quality monitor data of children with pre-existing asthma diagnosis are collected and fed into the machine learning models. Asthma exacerbation alert is generated and delivered back to the children before it happens. The machine learning model was tested and built in a similar study. The study population consists of children from a cohort of the prospective, population-based Children's Health Study followed from 2003-2012 in 13 Southern California communities. Potential risk factors were grouped into five broad categories: sociodemographic factors, indoor/home exposures, traffic/air pollution exposures, symptoms/medication use, and asthma/allergy status. The outcome of interest, assessed via annual questionnaire, was the presence of bronchitic symptoms over the prior 12 months. A gradient boosting model (GBM) was trained on data consisting of one observation per participant in a random study year, for a randomly selected half of the study participants. The model was validated using hold-out test data obtained in two complementary approaches: (within-participant) a random (later) year in the same participants and (across-participant) a random year in participants not included in the training data. The predictive ability of risk factor groupings was evaluated using the area under receiver operating characteristic curve (AUC) and accuracy. The predictive ability of individual risk factors was evaluated using the relative variable importance. Graphical visualization of the predictor-outcome relationship was displayed using partial dependency plots. Interaction effects were identified using the H-statistic. Gradient boosting model offers a novel approach to better understand predictive factors for chronic upper respiratory illness such as bronchitic symptoms.

/ IoT

Data as a Strategic Asset

by Lillian Coral, Chief Data Officer, Office Of LA Mayor Eric Garcetti

The City of Los Angeles, with 4 million residents and nearly 50 million visitors annually moving across 469 square miles, is not only one of the most densely populated cities, it also hosts one of the largest, most complex city infrastructures in the world. 6,000 miles of sewer underlie 22,000 miles of paved streets, that connect over 4,500 intersections, 50,000 city connected street lights and 2,000,000 google/waze connected sensors. This network of people and infrastructure are connected through the data and the systems that support them. As data transforms from an unstructured asset into the organizational wisdom that can drive this Smart City, the City of Los Angeles and the Office of Mayor Eric Garcetti work to identify new technologies and strategies for managing and harnessing the growing amount of data available to inform decision-making.

Enabling Scalable IOT Applications

by Adam Mollenkopf, Real-Time & Big Data GIS Capability Lead, ESRI

This session will explore how DC/OS and Mesos are being used at Esri to establish a foundational operating environment to enable the consumption of high velocity IoT data using Apache Kafka, streaming analytics using Apache Spark, high-volume storage and querying of spatiotemporal data using Elasticsearch, and recurring batch analytics using Apache Spark & Metronome. Additionally, Esri will share their experience in making their application for DC/OS portable so that it can easily be deployed amongst public cloud providers (Microsoft Azure, Amazon EC2), private cloud providers and on-premise environments. Demonstrations of will be performed throughout the presentation to cement these concepts for the attendees. All demos will be available on a public github repo.

Waze Carpool: A Little Selfless and a Little Selfish

by Eric Ruiz, LatAm Marketing Manager, Google Waze

Waze Carpool is the evolution of the Waze mission. If at first we wanted to help you save time by finding the fastest route to your destination, now we want to help you avoid traffic--by eliminating it all together. Traffic is a simple problem. There are too many cars on the road with too many empty seats.

Panel - Using IOT to Drive Productivity

by Stuart McCormick, Americas Digital Services Leader, Honeywell - Moderator; John Sullivan, Director of Innovation - SAP NA Center of Excellence, SAP- Panelist; Suresh Paulraj, Cloud Data Solution Manager/Architect, Microsoft - Panelist

Real Time Processing Using Twitter Heron

by Karthik Ramasamy, Co-Founder, Streamlio

Today's enterprises are not only producing data in high volume but also at high velocity. With velocity comes the need to process the data in real time. To meet the real time needs, we developed and deployed Heron, the next generation streaming engine at Twitter. Heron processes billions and billions of events per day at Twitter and has been in production for nearly 3 years. Heron provides unparalleled performance at large scale and has been successfully meeting Twitter's strict performance requirements for various streaming and iOT applications. Heron is a open source project with several major contributors from various institutions. As the project, we identified and implemented several optimizations that improved throughput by additional 5x and further reduce latency by 50-60%. In this talk, we will describe Heron in detail, how the detailed profiling indicated the performance bottleneck areas such as multiple serializations/deserialization and immutable data structures. After mitigating these costs, we were able to show much higher throughput and latencies as low as 12ms.

Audio Beacons - The InAudible Bridge from Big Data and Content to Mobile Smartphone Consumers

by Tom Webster, CEO, TONE

The Tone Knows is a Internet of Things Marketing and Adverting Vendor and Platform. We implement a next generation patent pending Internet of Things Advertising Technolgy using Audio Beacons. These Audio Beacons, unlike proximity beacons do not use bluetooth or wifi to transmit to the mobile phone. There is no hardware required. Here is a short demonstration of a Audio Beacon program we did with music artist Ariana Grande https://www.youtube.com/watch?v=rP36bCuA4kM In the demonstration, a high frequency tone is embedded inside the Music Video on the laptop. When the audio tone goes off, the mobile phone (which has opted in and have microphone turned on) can receive the audio tone and is sent a hyperlink to ecommerce, promotion, contest or anywhere else we sent it to. This is way to connect TV and TV Advertising, Radio and Radio Advertising, Youtube Video and Banner advertising to connect from content to mobile smartphone consumers.

IOT: The Evolving World of Realtime BigData

by Jerry Power, Executive Director, USC Marshall CTM

IOT technology will allow big data structures to evolve from static off line repositories of digital knowledge to on-line representations of our current world. IOT will allow the techniques used with big data to identify trends and forecast future to become operationally enabled data structures that allow us to manage our digital environment for maximal advantage. The road to this reality has several hurdles that must first be overcome. Among these hurdles are trust, privacy, discovery, and behavioral economics. These issues will be discussed in the context of a large city operations network and potential options to overcome these hurdles will be offered.

The Infrastructure under IOT networks

by Ward Bullard, Head of Product, Venues, Verizon

Big data will evolve from being an off-line analytic process to one where the data-analytic has to run real time as data arrives. That means the network has to provide complete connectivity because many of the IOT devices are mobile and you cannot afford to wait for a device to come into coverage to enable the data to flow. It also means the network has to be reliable allowing primary and secondary path connectivity. Performance (latency/call blocking/etc) has to go beyond meeting the needs of the current applications in order to be able to meet the needs of an evolved application environment with increased performance requirements. Once 5G deployments begin we can anticipate exponential performance improvements and this will allow the network to transform itself from being able to support thousands of IOT connections in the background of a full time internet network to being able to support billions of connections across a complex and heterogeneous infrastructure. In the end, IOT will serve to makes Big Data live and service providers like Verizon will provide the infrastructure that allows that to happen.

Organizers / 2017 Organizers

Subash D’Souza

Organizer, Sponsors & Sessions Chair

Organizers / 2017 Committee Leaders

Abraham Elmahrek

Sessions Co-Chair

Arti Annaswamy

Marketing Co-Chair

Frank Solomon

Marketing Co- Chair

Jerry Power

Sessions (IoT) Chair

Rich Ung

Technology Chair

Subash D’Souza

Organizer, Sponsors & Sessions Chair

Szilard Pafka

Sessions (Data Science) Chair

USC Marshall MS In Business Analytics

Location Sponsor

Volunteers / 2017 Volunteers

Abbass Sharif

Academic Director, MS in Business Analytics Program at USC

Abraham Elmahrek

First Employee at FOSSA, Inc.

Alain Mbuku

Member Technology Specialist at WeWork

Aman Mathur

Computer Science/Data Science Graduate Student at USC

Amee Lord

Talent Acquisition Manager at iSpace

Arjun Mahajan

Client Principal at Bitwise Inc

Arnold Borres

Sr Systems Analyst at UST Global

Arti Annaswamy

Manager, Global Ops at Warner Bros

Asha Dasi

Senior ETL Engineer at UC Irvine

Austin Clements

Senior Associate at TenOneTen Ventures

Bob Newstadt

Analytics and Tableau Expert at Bob Newstadt Consulting

Brandon Brooks

Technical Recruiter at Teradata

Charalampos (Harry) Papadimitriou

Data Scientist at DataScience.com

Chulhee Lee

Student, MA Applied Math at CSU, Fullerton

Clarissa Pinto Ribeiro

Technical Recruiter at Crescent Solutions

Diane Craig

System Business Analyst

Eric Lui

VP, Engineering at Second Spectrum

Frank Solomon

Chief Blogger at bitvectors.blogspot.com

Gina Escobar

Data Science Intern at Valor Water Analytics, Inc.

Jason Brancazio

Software Engineer, Data Services at Red Bull Media House

Jason Flittner

Analytics Manager at Netflix

Jerry Power

Executive Director, CTM at USC

Jerry Tsai

Data Scientist

Jimmy Kim

Enterprise Account Executive at Spectrum Enterprise

John Kim

VP, Operations at Captive Eight

Justin Javier

ACO Specialist at UCLA Health

Kaloyan Todorov

AVP, Credit Risk at Bank of America

Marc Wadell

Owner at TM Artists

Matti Siltanen

Dir Operations/IT | Sales Engineer

Michael Chiang

Executive Director at Crescent Solutions

Oszie Tarula

Programmer / Analyst III (Lead Web UI/UX Developer) at UCLA

Ravin Kumar

Supply Chain Systems Engineer at SpaceX

Rich Ung

Data Engineer at Disney ABC Television Group

Ruben Barrios

Datawarehouse Analyst at Tigo Guatemala

Saritha Ivaturi

Director of Data Systems at Dollar Shave Club

Spencer Huang

Director, Strategic Accounts at Qubole

Subash D’Souza

Director of Big Data at Warner Bros

We are currently accepting volunteers for BDDLA 2017.

Big Data Day LA is a fully volunteer-supported and organized event, and we have a proud history of volunteers who have joined us in helping organize the event in past years and stayed on as alumni volunteers, friends, cheerleaders, and mentors. As a volunteer, you will have the opportunity to help organize an event with 30+ sessions, speakers and sponsors from some of the biggest and best data companies in the country, and over 1500+ attendees expected this year.

You can participate wherever your interests fit best – we are accepting volunteers in the following teams:

– Marketing
– Technology
– Track / Sessions
– Location
– Food / Beverage
– Registration

Complete the form below to sign up to be a volunteer. We look forward to welcoming you to the team!