David Dietrich – InFocus Blog

Recently, I’ve been asked why we developed a Data Science training and certification program, so I wanted to share a few thoughts about this.

Over the past year, the media has simultaneously highlighted the value of mining big data, and the lack of analytical talent for exploiting its potential. We’ve certainly seen a huge growth in interest about Big Data, and everyone seems to be talking about it, from IDC studies to McKinsey and many others including Forbes and the Wall Street Journal. In fact, I’ve even read excerpts from comedians making jokes about big data – so now I know it’s really gone mainstream.

Along with the discussion about the growth of Big Data is a discussion about the need for skilled professionals to cultivate and transform this big data into something useful that drives new value. These people are what McKinsey calls “deep analytical talent.” These individuals are referred to by different titles, but “data scientist” seems to be sticking the most; people who are able to really dig into large amounts of data, with varying structures, and draw insights from it that drive new business value.

It feels a little bit like the dot com era all over again, in the sense that I see new companies popping up every single week coming up with some new creative way to monetize big data or offer some niche service related to analytics. I’ve seen this phenomenon especially occur around the analytics space, and I see a lot of organizations, both at the very small end with startups, and at the very high end with EMC, IBM and others, getting in the game. The difference is that each is taking a different approach.

IBM is taking a product-based approach to analytics, as seen from their Watson product and division. What EMC is doing is helping to define a new discipline, a new profession, the same way we did 10 or 20 years ago with storage administrators or data center architects. Before that time, technical people tended to be grouped under the broad heading of “IT.” So now what we’re doing is furthering this Data Science role and formalizing it, and helping to drive its growth in the industry. But instead of trying to teach someone how to operate or manage storage device, we’re teaching them how to draw insight from disparate pieces of information.

Realizing that the demand for Data Scientists far exceeds the supply at this stage, EMC felt a call to action. Here is a graphic from indeed.com showing the number of job postings containing the words “Data Scientist,” which is rising sharply, although the supply of Data Scientists remains low.

In pursuing this track, EMC is the first to offer “open,” vendor-neutral education programs for data science to educate not just the industry, but also to help shape what is becoming an entirely new profession.

If you want to learn more about EMC data science training, I invite you to attend my session on “Disruptive Data Science” at EMC World, where I will discuss how to build a data science team and will share examples of organizations that have done this successfully.

In my future posts for InFocus, I’ll talk more about the varying backgrounds and experiences of people trying to make the transition to data scientists.

At the recent EMC World event in Las Vegas, I presented a breakout session entitled “Disruptive Data Science”, which focused on how people are making the move to becoming Data Scientists.

As I discussed in the session, on a basic level, Data Scientists possess and exhibit the five main qualities shown in my graphic below. Among the people I’ve spoken with, this profile of a Data Scientist really seems to be resonating.

In addition to possessing the “hard skills” of being quantitative and technically focused, Data Scientists also have versatile communication and collaboration skills, and an innate curiosity for exploring and experimenting with data. They also tend to be skeptical people, in that they are likely to ask a lot of questions around the viability of a given solution and whether it will really work. These behavioral traits are what separate someone who can work with others to use data to drive change, from someone merely building an interesting algorithm in their basement.

After attending the recent Data Science Summit in Las Vegas, I would say that these traits described many of the participants. The range of speakers was striking, spanning leading academic institutions that have thought leadership in Data Science (such as UC Berkeley @joehellerstein; Rice @hadleywickham; Stanford @jure; and Columbia @chrishwiggins), Data Scientists who have driven innovation in sports (@tarekkamil) and health care (@johnbrownstein), and others who are masters at telling stories with data (@jjhnumber27 and his site http://www.wefeelfine.org/). These people exemplify how these skills and behavioral traits together have the power to drive change and innovation across many disciplines.

In a brief EMCTV segment, I talked about my experience developing EMC’s Data Science course, the typical traits of Data Scientists, and people who seem to be coming out of the woodwork to change their careers and realign themselves toward careers in data science.

In my next post, I’ll highlight 2-3 typical profiles of people who are making the transition toward data science and steps they are taking to get there. Do you agree with this list of Data Scientist traits, or see other additional characteristics as critical? Let me know in the comments section.

In my last blog post, I described five main characteristics of Data Scientists: quantitative, technical, skeptical, communicative, and curious.

Although we have had people of many different backgrounds attend EMC’s Data Science and Big Data Analytics course over the past few months, the attendees tend to fall into four main categories. These aspiring data scientists can most simply be described by their place along the 2 vectors of quantitative skills and technical ability:

Generally, when I teach people how to look at graphics and dashboards, I convey the ideas that Stephen Few espouses in his book “Information Dashboard Design,” namely, that you have to think of images in the same way people view information. In Western culture, people read from left to right and from top to bottom. Therefore, they will view the upper left quadrant of a page of information or an information dashboard first (and pre-attentively), and the lower right quadrant last. As such, people are taught to organize their information this way too, with their most important points in the upper left and the least important points in the lower right.

However, I’m going to violate the rule for the purpose of this narrative and begin by discussing the group in the lower right quadrant of this matrix.

This group, with strong technical ability and some quantitative skills, generally describes Business Intelligence professionals, IT specialists, and data warehousing professionals. This group tends to possess strong skills with SQL and even with manipulating large-scale databases, but their challenge to becoming Data Scientists is their need to further develop their quantitative skills. This requires adding more rigor in mathematics and statistics, and extending their comfort zone from Business Intelligence and data warehouse tools to more math-oriented and statistical tools, such as R, SAS, SPSS, MATLab, and Weka.

As a student in one of my recent Data Science & Big Data Analytics classes told me, “When I learned programming, I was taught to solve specific problems. Therefore, I learned to write code in a given structure…first write chunk A, then build on that piece by creating part B, then C. I’m realizing after this week that Data Science is different. Many times you may begin a Data Science project without such a clear picture of the end goal, it’s much more exploratory.” This student articulated it well, in terms of a good way to consider the change in mindset necessary for someone in this quadrant to think more like a Data Scientist. They must embrace more quantitative training to supplement their skill set, and also view projects in a more exploratory way.

Although the group that fits in the lower right quadrant has strong technical aptitude, for whatever reason, the respondents in EMC’s 2011 survey on Data Scientists do not appear hopeful that today’s Business Intelligence professionals will step up and make the transition effectively to become the next generation of Data Scientists. EMC’s Chuck Hollis wrote an insightful blog post on this subject. From my experience, a lot depends on the drive of the individual and how self-motivated they are to learn new techniques and push themselves. As the Mass Technology Leadership Council (MassTLC) states in its paper on Big Data skills in Massachusetts, we face a shortage of Data Scientists and also a shortage of users and developers of Big Data applications. I believe that Business Intelligence professionals have growth opportunities in both of these areas. Even if they opt not to expand their quantitative skills and enter the realm of Data Scientists, they could still make significant contributions to this area by developing new Big Data tools. They could also carve out a niche as users of sophisticated analytical tools and applications being developed which provide a simpler, better experience to most business users. These are people who may be tasked with analyzing Big Data problems, but who may not be actual Data Scientists.

Consider applications such as Splunk, Socialcast, Datameer, and others that serve to bridge the gap and allow savvy business users to analyze unstructured social media and machine data. These tools enable users to analyze messy, complex data without requiring the end user to be a full-fledged Data Scientist. I speculate that many Business Intelligence people could carve out a niche by contributing to this growing area.

In my next InFocus post, I’ll talk about the next quadrant in the skills matrix.

In my last InFocus post, I described how Business Intelligence professionals and other technical people are taking steps to become Data Scientists. In this post, I’d like to discuss the next quadrant in the Skills Matrix: people with a limited or moderate background in technology and/or quantitative skills who would like to consider a path toward becoming a Data Scientist (quadrant “B” below).

Based on the Data Science and Big Data Analytics classes we’ve held so far, I would say this quadrant tends to describe recent STEM graduates. To be fair, some young people completing a degree in science, technology, engineering or math may have rich academic backgrounds in quantitative disciplines and may actually belong more properly in quadrant “A” or “C” depending on their area of specialization. However, I’m using this matrix as a tool to broadly categorize skills, and also to reflect that while these recent grads I am placing in this quadrant may have good foundational knowledge, they do not yet have much work experience to cultivate their skills.

The group in this quadrant tends to be eager to learn. They strike me as up to the challenge of learning about data science, and from their recent experiences at school, are used to pushing themselves hard. I recently told someone who I would consider to be in this quadrant that in order to continue his path to data scientist, he would need to practice analytics methods, read a lot and experiment. His answer, which was said with a smile, was “OK, no problem. I’m a grad student, I’m used to working hard and having no life.”

Reading difficult texts and trying challenging new projects is similar to this group’s recent school experience, so learning data science skills is simply a continuation of that habit. Adding to this, I find that people in this group have good foundational knowledge, but perhaps have not had much opportunity to apply this knowledge to real-world problems. Data science provides them with this opportunity. With some on-the-job experience and interactions with real-life problem sets, those in this group are good candidates for the next generation of data scientists.

I’m highlighting this quadrant because one of the most challenging aspects I’ve observed in moving people along the path to data scientist is understanding what people need to unlearn to take this path. For longtime Business Intelligence people or other midcareer professionals, this means changing established habits and ways of thinking, which can be very difficult. Recent STEM grads, however, do not have this mental baggage or such well-ingrained habits, and are more open minded to taking on new challenges and pushing themselves to learn new things. As an example, we wanted our recent summer intern to write some scripts in Python so we could automate several processes. He didn’t know Python, but despite being only 19 years old, he taught himself Python in one week, and the following week wrote the scripts for us. This open-mindedness and willingness to learn are key attributes for people who want to become Data Scientists, and because the domain area is so vast, I think these traits are even more critical for recent STEM grads. My counsel to people in this quadrant is:

Seek out projects where you can try out your skills. This could be within your company, as part of an internship, in your community on a pro bono basis, such as with DataKind, or Data Without Borders, or in a contest, such as Kaggle.com. Hands-on practice will tell you very quickly where you are strong and where you need to focus on improvement.
Consider continuing your education with EMC’s one week Data Science & Big Data Analytics course, or with one of the free semester-long courses online, at places such as Udacity, Coursera, edX, or CodeAcademy.

In my next InFocus post, I’ll talk about the people in quadrant “A” of the Skills Matrix, those with quantitative background, but limited-to-moderate technical skills.

In previous InFocus posts, I’ve discussed how Business Intelligence professionals and STEM graduates can make the move toward becoming Data Scientists. Now I’d like to discuss those people that would fit in the upper left quadrant of the skills matrix (Quadrant “C” below)– how business analysts and quantitative analysts can take steps to become Data Scientists.

In the Data Science and Big Data Analytics classes we’ve held so far, I would say those with the skills/ability mix of this quadrant tend to be Quantitative Analysts, Statisticians, Business Analysts, and Data Analysts. This group generally has some background in quantitative methods, whether it is statistics, mathematical finance, or other training in quantitative methods. They also have some experience with technology, although this may be more limited.

From my experience, this group generally embraces data science and the teachings from the course. I believe this is because they are in roles that require them to solve various kinds of data problems, although they may not always have the experience or training in advanced analytical methods to solve these problems.

EMC’s Data Science and Big Data Analytics course introduces them to more advanced analytical methods and technologies, which enable them to make leaps forward. For instance, people who are pricing analysts may have the responsibility of analyzing discount rates for products, creating dashboards, and looking at trends of pricing degradation over time. One business analyst who completed the course really took to these new methods from machine learning and statistics, and now has significantly raised the level of his analytics. With this new training, he is able to use different kinds of regression models to predict pricing for products and product sets, and can determine optimal pricing for specific situations. He can use a much more rigorous approach to establish market pricing, thus increasing his value as an analyst.

Ellis Kriesberg, Manager of Business Operations at EMC, shared his perspective with me after completing the Data Science course, “The course helped me in three key ways. First, it helped me better understand the different data science techniques, and which are best for particular problems. Second, it made me realize that my interest in applying predictive analytics is shared by many others at EMC and is part of a growing movement. Finally, it provided a very useful methodology for summarizing and presenting the results of data science projects.” Ellis is a great example of someone who already had strong analytical skills, and was able to leverage his newly acquired data science techniques to build his expertise and apply these methods to his job immediately.

Other people have become more proficient at data cleansing, enabling them to now leverage more data, and “messier” data, than before, as well as use some of the power of open source tools such as R or Postgres SQL to manipulate the data and prepare it for a more rigorous analysis than they may have done in the past. In general, the profile of a person in Quadrant C is someone used to doing analysis, but who may not have had the tools in their tool set to fully solve some of the problems they are encountering. Once we give them more tools, I find they readily try them out and are hungry for more.

Some of the people in this quadrant have become strong advocates for the Data Science course, and evangelists of data science and machine learning within their departments. They are enthusiastic about trying out the new methods they learned, and have ample projects on which to try them. This becomes a virtuous, reinforcing cycle – they try new methods on problems, they learn from the experience, and they build on their skills when they move onto the next problem set.

For business analysts in this quadrant, learning just a few new methods and a bit of technology enables them to be much more proficient with the day-to-day problem sets that they face. I recommend that people in this quadrant try some light programming or a few hands-on introductions to analytics to further their skills. Nathan Yau’s book “Visualize This” is a good introduction to visualization and storytelling using open source tools, and can serve as a gentle introduction to improving how to portray data, as well as to programming in Python, R, and other languages. Similarly, “The Art of R Programming” by Norman Matloff provides many hands-on examples for using the R software and programming language to solve analytical problems.

Ellis also suggested the following to continue learning:

Look at some of the free on-line courses on machine learning or data science, such as “Computing for Data Analysis”
Join a Meet Up group, such as Boston Predictive-Analytics (in my area, though these are all over the world now).
Use free online text books and articles about data science such as “Introduction to Statistical Thought” by Michael Lavine, which makes extensive use of R

In my next InFocus post, I’ll discuss the people in Quadrant “D” of the Skills Matrix, those with strong quantitative and technical backgrounds.

In an earlier InFocus post, I discussed five attributes of Data Scientists.

In developing the EMC Data Science & Big Data Analytics course, we collaborated with the Greenplum Data Scientist team. One member of that team, Kaushik Das, likened a Data Scientist to a sculptor, in that master sculptors see the world differently than most people. Where most people would just see a block of marble, a master sculptor can see a statue hiding within the raw material, and views their job as chiseling away the exterior and the pieces of marble to reveal the work of art.

Likewise, Data Scientists have the ability to see hidden possibilities. Where most people look at data and see unrelated information, it is the job of the Data Scientist to look for the insights lurking within the data. Like the example of chiseling away excess marble to reveal art, data must sometimes be reshaped, cleaned, or formatted in the right ways in order to produce unexpected insights.

More and more, I’m encountering researchers and Data Scientists who are not just doing data science projects in controlled ways, but are using the world around them as a sandbox in which to experiment with data and test ideas on a large scale. Recently, I had the privilege of attending an MIT Computer Science & Artificial Intelligence Lab (CSAIL) lecture as a result of EMC’s participation in the bigdata@csail initiative. The lecture was delivered by Jeffrey Dean and Sanjay Ghemawat, who together developed the MapReduce computational framework, and are co-designers and co-implementers of heavily used distributed storage systems, including Bigtable and Spanner at Google.

Dean spoke of using MapReduce, not in a controlled environment within a small IT shop, but rather on a very large scale, on publicly available data. One example Dean discussed was using MapReduce to tag and classify sets of images. Doing data experiments on the world around them, the team is tapping into large scale, publicly available data stores to test their algorithms. Most of the work they are doing in R&D is about finding ways to develop algorithms that can mimic human thinking, but with data. For example, to test out an algorithm to classify images, they performed unsupervised learning, trying to find hidden structures in unlabeled data, on one frame from each of 10 million YouTube videos. Rather than create a training set to train an algorithm (a more traditional supervised learning method in this case), they conducted this experiment at a large scale, to test how neural networks would work to classify video frames as images on their own. The resulting classifier was much, much better than most other existing methods, when they used ImageNet, which contains 16 million images in 21,000 categories. This is an example of testing robust algorithms in the wild, and using the world as a petri dish in which to test hypotheses. Another great example is the project EMC is sponsoring, called The Human Face of Big Data.

Rick Smolan, a former Time, Life, and National Geographic photographer, has authored numerous books, and is perhaps best known for his “Day in the Life” book series. He is spearheading The Human Face of Big Data project, which is designed to demonstrate “how real- time sensing and visualization of data has the potential to change every aspect of life on earth. It may represent one of the most powerful toolsets humanity has ever created in addressing some of our biggest challenges.”

This is part of the Quantified Self movement, in which the intent is to show how Big Data is touching our lives and that of our families. As part of this, Smolen encouraged people worldwide to download an app that, for one week, turned our smartphones into sensors to record and share data. In other words, the apps on the smartphones anonymously tracked and shared users’ habits and preferences, and compared them to others in the world. These images are taken from Rick Smolen’s video of the project.

Here is a list of the industries he touched…basically most major areas of people’s lives:

One of the great things about Smolan’s project is that he highlights some of the success stories and the benefits of becoming a data-driven society. For instance, he tells the story of people who developed cheap early warning systems, based on sensors in a common laptop computer, to sense earthquakes in Japan. As a result, one minute before the 2011 Japan earthquake hit, all of the bullet trains and transport systems were halted, which prevented further casualties.

Another example cited by Smolen analyzed crime data in New York City. Rather than plot data on a map, someone analyzed all the data and found the home addresses of convicts before they went to jail. They then used this information to target locations for career counseling services and crime prevention and education programs.

One final example is a Prius dashboard. Not only does it track gas mileage, but it provides feedback to the driver, who then adjusts their driving habits to be more fuel efficient based on the feedback.

These are common-sense examples, but they show how Big Data gives us the ability to analyze everyday problems in innovative ways. I would encourage you to consider what you do each day that you could optimize by having more data, or at least try to think a little bit like a Data Scientist and test ideas more quantitatively. For further reading:

If you want to learn how to reshape data, check out Professor Hadley Wickham’s paper, “Tidy Data,” which teaches people straightforward techniques for reshaping data in the R programming language, and uses his very popular libraries for this, such as Reshape.
Tom Davenport & DJ Patil recently published an excellent article about Data Scientists in Harvard Business Review, ”Data Scientist: The Sexiest Job of the 21^st Century,” highlighting Data Scientist skills, the future of this profession, and how universities and companies such as EMC fit into this mix.
For more examples of work that Data Scientists are doing, see these videos from the May 2012 Data Science Summit
To learn more about the Human Face of Big Data project, see this short video by Rick Smolan.

Recently, I was in New York for the 2012 Strata & Hadoop World conference and the New York Data Week. Strata is a somewhat young conference, and this year the conference organizers decided to combine the New York Strata conference with Hadoop World. I give the conference organizers tremendous credit—they built a great agenda and had many excellent speakers.

As one might guess, there was intense interest in using data at every opportunity. For efficient conference check-ins, laptop kiosks were set up to assign attendees to one of three dedicated lines to print registration badges (I hope they used proper queuing theory for this….). Data visualizations and data art lined the entry ways to the exhibit halls, showing everything from animated maps of U.S. wind patterns, to radial charts of MLB baseball playoffs over the last century:

As someone who lives and breathes Big Data and data science, there were many, many interesting things to see and hear at Strata. So, since the event was held in New York, I’ll offer these observations as a Letterman-style Top 10 List:

TOP 10 OBSERVATIONS FROM STRATA

10. Demand for Data Scientists remains very strong. As you can see from these photos of the job boards, job openings are plentiful.

9. Many big data tech companies are focusing on making it easier for people to interact with Hadoop. Many of these tools offer people the ability to do read data from Hive or HDFS, and allow business users to manipulate data visually, instead of from a command line.

8. A smaller subset of companies are developing tools to provide analytical intelligence on top of the data, and help automate decisions. To really take advantage of what Big Data has to offer, companies need both Big Data Business Intelligence tools and the ability to do complex analytics at scale.

7. Greater adoption of Big Data concepts in more industries: Healthcare, government, criminal justice, education, and transportation are just some of the industries leveraging big data.

6. Big cities are driving innovation by investing in data-driven decision making…..and crowdsourcing. Many cities (Chicago and New York, of note) are hiring Chief Data Officers, and driving change at state and municipal levels for a variety of public services, ranging from police and crime prevention, to fire fighting, and mass transit improvements.

5. Data transparency via more open APIs and data sharing. In order to solve large scale analytical problems, many groups are turning to crowdsourcing. To do this, more groups are promoting data sharing and data availability via APIs.

4. Increased data transparency is raising data privacy concerns. As data becomes more widely available and data sharing increases, so do concerns related to the need for data privacy.

3. Rise of Philosopher Entrepreneurs. Two millennia ago we had the Philosopher Kings, now we are seeing more Philosopher Entrepreneurs, who are lateral thinkers with creative minds. These people are good at connecting disparate ideas and finding opportunities to start new companies in the big data space.

2. The Big Data revolution is still young and we are nowhere near the end.

1. Don’t try to impress Data Scientists by wearing a suit. One speaker began his talk like this: “Hi, I am from a vendor, and after listening to the other speakers, I know all of you in the audience must think I’m the Antichrist. For some reason I opted to wear a suit today, so you must think I’m a double Antichrist.” To be clear, at this conference, a “power suit” consists of a black t-shirt or untucked dress shirt with jeans. Think closer to Sheldon Cooper’s wardrobe than Don Draper’s.

If you are interested in seeing videos of presentations from the conference, many are available on the Strata Conference YouTube Channel. They will expose you to many of the new ideas in the Big Data space and also to many of the active Big Data technology providers.

In my upcoming posts, I’ll expand on several of the concepts mentioned above, and provide more concrete examples.

IDC’s Digital Universe study, “Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East,” monitors the growth of digital data, and shows that data continues to grow, both in its raw number and in its rate. Even more so than a year ago, we are seeing the explosion of digital information, driven by social data, sensor data, and the Internet of Things, accelerate at an ever-increasing pace.

This proliferation of data represents huge opportunity for change. As Anthony Goldblum, the CEO of Kaggle recently said in Forbes, “I’ll go into a company and say, ‘What data problems can we solve?’ We get blank looks. [When he asks, instead, what things can help a company lose money and make money, usually two out of three are] problems that data can solve.”

The explosion and availability of data demonstrated in the Digital Universe study is not just about having more and more information; it is an opportunity to drive change. Here are three quick takeaways on this point:

1) Data is growing fast. The digital universe is growing more quickly than expected and will hit 40 zettabytes – or 40 trillion gigabytes – by 2020. This growth is heavily driven by the increasingly huge amounts of data produced by the Internet of Things, sensor data, and social media data. These three things are pervading many, many areas of our business and our lives, as evidenced by the recent Human Face of Big Data project. This has potential to revolutionize everything from the way we drive cars (think of Progressive Insurance’s real-time sensors that gauge how you drive and give discounts to safe drivers) to healthcare and personalized medicine (Human Genome project and crowdsourced medicine) , or even education (crowdsourced education with Coursera, Udacity, and others) and government.

2) Big Data = Big Opportunity. As mentioned in the Digital Universe study, “Massive amounts of useful data are getting lost in the Digital Universe. For instance, only 0.5% of the Digital Universe was analyzed in 2010.” If Big Data is the new gold that people are mining for, this means that most of the data out there is waiting to be mined. According to the study, only 3% of the potentially useful data is tagged, and even less is analyzed. This is the Big Data Gap: untapped information ready to have its hidden value extracted. Consider for a minute how this breaks down for 2012 (data shown in exabytes):

This means that nearly 895 exabytes this year were not analyzed, so there is plenty of opportunity for people to find creative ways to extract new value from this.

3) As the data deluge grows, so does the skills gap. This year’s Digital Universe study projects that by 2020, the number of servers will grow 10x and information managed by enterprise data centers will grow 14x, yet the number of IT professionals will grow by less than 1.5x, creating a huge technology skills gap. As Paul Barth showed in the Harvard Business Review article “There’s No Panacea for the Big Data Talent Gap,” it continues to be difficult to find Data Scientists:

These findings underscore that the real challenge is not just where to put all of the data, but what to do with it. Extracting hidden insights out of a massive amount of information requires better tools, powerful infrastructure, and talented people, such as Data Scientists. As cited in the same issue of HBR, DJ Patil and Tom Davenport talk about how to find and retain Data Scientists, which are a scarce commodity these days. In fact, as I mentioned in my blog post after the Strata/Hadoop World conference, I believe we are still early in this adoption cycle and this will only get bigger, as more people wake up to the opportunity provided by Big Data. As a result, the demand for Data Scientists will only continue to increase. Last year at Strata, every speaker ended their talk by saying that if anyone was looking for work as a Data Scientist, they were hiring. However, because of the shortage of available Data Scientists, we are realizing that most companies are, instead, trying to grow their own data science teams. EMC recognized this emerging trend, and as such, created the industry’s first “open” course focused on Data Science & Big Data Analytics to help fill the gap. It is also critical for corporations to join with academia fill the gap and skill the next generation of Data Scientists. Programs like the EMC Academic Alliance, which recently announced more than 1,000 partner colleges and universities to date, are a great example of this.

In our Data Science & Big Data Analytics course (also mentioned in Davenport and Patil’s HBR article referenced above, “Data Scientist: The Sexiest Job of the 21^st Century”), we teach people a structured, lifecycle approach to Big Data projects, how to perform advanced analytics, and also how to communicate findings. My point is that it’s not enough to help people become strong at writing algorithms; they also need to be good at communicating – either verbally, in writing, or with data visualizations. As someone commented in a recent class I taught, “Data science is a team sport.” It’s not that the need for reporting and a solid process go away, but you need everyone working together to do the things that need to get done every day. You also need vision to see the world differently and to find hidden opportunities in data. This is the real value provided by Data Scientists—they help extract value that enables organizations do things they never could before.

In other words, this is not just about the shortage of Data Scientists, which is growing more challenging all the time. It is also about the need to enrich the skills of more traditional IT roles, such as Database Administrators (DBA), Business Intelligence Analysts, and Data Engineers, as well.

In past posts, I’ve written about the profile of a Data Scientist, especially the skills needed for people to grow into this new role. I’ve also written about the opportunities that Big Data provides, drawn in part from the Human Face of Big Data book, written by Rich Smolen.

Despite all of the terrific possibilities that Big Data can enable, people ultimately need to realize that the value of the data is about people and driving change. About two years ago, Harvard Business Review published an article in which authors conducted research studies trying to pinpoint the key drivers of productivity within an organization, which included decision-making ability, access to social media, and many other elements. Ultimately, the most valuable productivity driver proved to be access to information. The implication of this finding is that if those people making important decisions have access to more or better information than others, they will make better decisions.

Now fast forward to a more recent HBR article from January/February 2013 in which the authors have a somewhat different perspective, saying it is not enough to have better information; you also need to act on it. The reality is for many people data is foreign, and making decisions based on it can be counterintuitive or confusing. That is, we have a situation where people may get a report on certain information, new markets, or flagging products, but they choose to ignore the information it provides because they may not know how to interpret the data or what to do with it.

Recently, I was at Babson College participating in an advisory roundtable discussion about analytics and Big Data. The advisory group in which I was participating was asked “What are the biggest impediments to adopting analytics or Big Data, and what can be done about it?” A panelist seated to my right, Chief Scientist at a local Big Data company, commented that “People need to learn to trust their data. Trust the math.”

He explained that he has a PhD in aeronautical engineering, and in his former life he had to trust what the data told him, because it had a life or death impact. Putting 500 people on a plane at 35, 000 feet above the ground meant that the data had to be correct, and the people building and controlling the planes had to trust it.

I think this is a critical truth that organizations must embrace in order to make the leap to using Big Data in the confines of their organizations. It’s not enough to just buy a new piece of analytics software, or store much more data than before. Unless they believe data analytics will work and they are willing to try it, I don’t think people have a chance to drive change. And isn’t that what all this is about? If we are collecting mountains of data, but no one is analyzing it, then we are not deriving any new value. If people are analyzing data, but the analysis is not driving any decisions, then it’s a lot of work for nothing.

At a previous employer, I worked as an advisor to banks examining portfolios of mortgages and other types of consumer loans to assess their level of compliance with Federal and State regulatory laws. In case after case, the lenders who fared best were, of course, the ones who did rigorous analysis and took the appropriate actions. Among the rest, it was clear that it was worse to do an analysis and know a problem existed and ignore the data, than it was to never do an analysis at all. The reality, however, is that it is very difficult to change habits. If people are used to making decisions a certain way, then introducing more data may not influence their decision process. For these reasons, I think it is incumbent on leaders who want to implement Big Data strategies to be open-minded – take pains to identify and challenge operating assumptions. As Charles Duhigg argues in his book “The Power of Habit,” any habit can be changed, but one must be conscious of the habits that drive behavior and decisions.

Duhigg’s book discusses a now-famous example of how the retailer Target used Big Data and advanced analytical methods to drive new revenue. As I mentioned in a recent talk I gave at Tufts University, which is available as a podcast, Target realized that it made chunks of money from three main life-event situations:

Marriage (people are buying a lot of new products)
Divorce (people are changing their spending habits)
Pregnancy (people have lots of new things to buy and have an urgency to buy them)

By far the most lucrative, is the third situation, Pregnancy. Using data collected from shoppers, Target was able to identify this fact and predict which of its shoppers were pregnant, in some cases even before their families knew. (See this article in Forbes for more details) My point in sharing this is that I believe for Big Data Analytics to really take off, organizations must rethink how they make decisions. As Professor Tom Davenport notes in his book “Enterprise Analytics,” one reason that “Analytical approaches to decision-making are on the rise is….the movement of quantitatively trained managers into positions of responsibility within organizations.” In other words, the people making the decisions now understand data and are willing to make decisions based upon it.

There are many “prescriptive statistical methods” for improving decision making, but at a basic level the questions that should be asked when making decisions should be:

What does the data say?
Do you trust the data? Is it of sufficient quality to make this decision?

For these reasons, I believe one of the other phenomena we’ll see in the era of Big Data is the need for change management. Because habits and existing processes are difficult to change, help may be needed help to modify habits and processes in order to operationalize Big Data analytics. If you’d like to learn more about these topics, I’d suggest the following books:

“The Power of Habit” by Charles Duhigg. Great discussion about how to change any habit–personal, organizational, or otherwise– and what the drivers may be. Duhigg talks about the triggers, routines and rewards of habits. What if we changed the triggers and rewards to promote data-driven decisions?
“Thinking Fast and Slow” by Daniel Kahneman. Provides many examples of how people make decisions and how we slip into automated reactions without really digesting the facts of a situation. (This is a derivative work of his Nobel prize-winning research on economics.) Part of this is teaching people how to think statistically, or to recognize data in making decisions.
“How to Measure Anything” by Doug Hubbard. Provides very actionable and useful ways to measure the value of analytical and data projects, many of which involve simple methods to assess the value of information and decisions.

In my last InFocus post, I talked about the need to trust the data when making decisions. Of course, trusting data implies that you have data, and many organizations begin without realizing this critical fact.

MOVING TOWARD BEING DATA DRIVEN

Not long ago, I spoke with someone who told me their company had a lot of recent, unexpected product sales. They wanted my help to understand who these customers were, and what could be learned about them. After some discussion, it was clear that there were no real systems in place to track the company’s product sales, and little or no data being gathered to track details about the buyers and customers. So, one of the obvious first steps to fix this situation was to understand what kind of information needed to be collected and what customer data would be useful. Greenplum’s Data Stream blog offers some good suggestions on ways to improve this process, as it pertains to Data Availability.

MAKING DECISIONS IN THE FACE OF UNCERTAINTY

I’ve written in the past about the need for better data-driven decision making. One challenge is that people are creatures of habit and they thrive on pattern recognition. In other words, they know how they’ve made decisions in the past, and use that information to make similar decisions the same way in the future. Consider the Ellsberg Paradox:

Adapted from the book “Iconoclast: A Neuroscientist Reveals How to Think Differently” by Gregory Burns, Ph. D.

The urn on the left contains ten black marbles and ten white marbles. The urn on the right contains twenty marbles of an unknown ratio of black to white. Draw a black marble to win $100. Which urn do you choose from?

Most people will choose the urn on the left, since it has a known proportion of black and white marbles. But if you choose the left-hand urn as the one that provides the best chance for getting a black marble, if you reverse the question, shouldn’t the right-hand urn be the better choice if you had to choose the one with the best chance for a getting a white marble? The reality is most people will still choose the left-hand urn, regardless of the desired marble’s color, simply because they do not like choices with a high degree of uncertainty, even if they know one of the choices will be sub-optimal or incorrect.

Aside from the math involved with this, I’d like you to consider the decisions you make every day and ask yourself this question: how many decisions do you make because you are more familiar with a given outcome and the options, even if the decision might not be the best in that scenario?

MAKING DATA SCIENCE-DRIVEN DECISIONS

Although it’s great to emphasize better “decision support” to guide complex or uncertain situations, as one of my colleagues, Annika Jimenez, says in her recent blog post “It is now no longer enough to be a ‘data-driven enterprise.’ Instead, you must build a data science-driven enterprise, a.k.a. the predictive enterprise.”

In other words, it doesn’t scale to just manually see about making better decisions on a one-off basis; you have to also introduce technology to automate some of the decision making and insights. This is where new technologies such as Greenplum, Hadoop, and Pivotal HD come into play. These tools are making large-scale analytics accessible to more people so that decisions (the right kind of decisions, of course) can be automated and more rapidly deployed and spread. This is one of the key reasons for Pivotal HD; it provides a way to make analytics more accessible to Database Administrators (DBAs) and Data Engineers. Thinking back to this 2×2 matrix I provided in earlier posts, Pivotal HD should appeal to those people in the lower right quadrant (“A”), of which there are many.

If this kind of technology improvement provides access for people with strong SQL skills, it means that DBAs, IT, and Business Intelligence professionals can do more to contribute to data science and to tackling Big Data-related problems.

Recently, I was asked by a friend at another company “What do people need to do to ‘do Big Data’?” As you can see, part of the answer is to improve better human judgment and educate people how to make better, data science-driven decisions. The other part of the answer is to get scalable technology that can rapidly automate complex decisions.

Recognizing that many business leaders struggle with this same question, “How do you ‘do Data Science’?” I am working with EMC to develop a new course designed to help business leaders answer this question. Coming in early Q2, “Data Science and Big Data Analytics for Business Transformation” will provide business leaders with the knowledge to approach projects in more data-driven and data science-driven ways. It will give them the skills to look at common business problems as opportunities to apply more advanced analytical methods and will teach them how to develop analytics teams. I’ll be sharing more details about this in future posts.

In his book “Competing on Analytics: The New Science of Winning,” management and analytics expert Tom Davenport says “Executive sponsorship is so vital to analytical competition…,” and talks about the need for data-driven CEOs such as Gary Loveman (Harrah’s casino) and the role of the Chief Analytics Officer (CAO) or Chief Data Officer (CDO).

In this diagram adapted from Davenport, the difference between those companies at the bottom of the pyramid and those in the middle or the top is executive sponsorship. Ask yourself, “Who is the executive in charge of analytics at my company?” If you can answer this clearly, then you will be at least in the top third level of this pyramid, if not higher. If you can’t name an executive in charge of analytics, then it is likely that you will continue to have localized pockets of people doing analytical work, but not widespread, predictive analytics initiatives informing the company’s strategy at the highest levels of the organization.

THE DATA-SAVVY CEO

This, of course, raises the issue about the need for a Chief Data Officer (CDO) or Chief Analytics Officer (CAO). Much has been written on this topic, in terms of the need for these roles and also the scope of the position. There are a few disagreements on this subject, but there is consensus on the main aspects:

There is a clear need for executive sponsorship, whether the role is filled by a defined CDO/CAO or someone else
The person filling this role should have strong background in analytics, an entrepreneurial mindset, and a solid understanding of technology

I believe the CDO or CAO are useful and needed roles for an organization. I’m not fully convinced that every company needs one, but I am convinced that companies need senior executive support to ensure that analytics (or probably any big initiative for that matter) is widely adopted and made integral to their business. In some cases, a data-driven CEO may even be preferable to having someone in a CDO/CAO role.

Take the case of Gary Loveman at Harrah’s casino.

Lovemen was COO at Harrah’s, and in that role intimately understood the operations of the business. He knew how much was spent on various activities, what the key drivers were for many different parts of the business, and how to optimize them. He could figure out how much time people spend on the casino floor, what kinds of incentives could be offered to encourage people to stay and continue gambling, (perhaps the level of incentives are to offer a free drink, then offer a free breakfast, and then offer a free hotel upgrade to entice gamblers to stay longer….?), and at what point do people throw up their hands and leave no matter what is offered to them. In a role like his, one develops an appreciation for data. They understand that getting answers to questions like these allows them to interpret behavior and gain from it. Not everyone is in the gaming or entertainment business, but many of the same principles apply in other industries. When Loveman became CEO of Harrah’s, he already had a data-oriented mindset that he could bring to running a large enterprise, and as CEO, he had the formal authority to institutionalize these principles.

In a talk at Strata conference 2012, Diego Saenz highlighted three skills of a data-driven CEO:

Strategic Data Planning – Data is now the new raw material for businesses, along with capital, people, and labor.
Analytic Understanding – It is important to ask the right questions when presented with information. CEOs who fail to develop their analytic skills may draw the wrong conclusions when presented with data and make ill-fated business decisions.
Technology Awareness – In order to be successful, it is critical that CEOs embrace technology, and make it a key component of their skill set. They don’t need to become technology experts, but they should have at least a fundamental understanding of the capabilities available to them, and what they mean for the organization.

THE ROLE OF THE CDO/CAO

Now, back to the question of CEO vs. CDO/CAO. I am in favor of the CDO/CAO role, however, I think in practice, it can come with a few complications. Let’s imagine that instead of having a data-oriented CEO, you hire a new CDO/CAO to take care of the analytics responsibilities. You should quickly see some improvements resulting from having someone on staff that can look at business problems as analytics challenges, and know which may be possible to address and resolve via data and advanced analytics. For example, what if this person could view the problems facing your executive team as something like this:

This mindset would enable you to inject analytics into many areas of your organization where they may be lacking today. You may be able to improve many areas of your business rapidly by infusing it with this mindset. As I see it, however, there are some potential challenges:

How does this person gain the trust, credibility and formal authority to drive analytics through the organization? This person has the opportunity to influence the way much of the business will be run, but unlike the CEO, does not have the same formal authority.
What is really the charter of a CDO/CAO? Is it to serve as a business advisor on all matters analytical? Or is it to get data governance and data collection in shape across the business?

My perspective is that the CDO/CAO should provide deep expertise with analytics and the understanding of which business problems are solvable and could be addressed by data and analytics. In addition, I think it is critical that they oversee and coach the people who are doing the actual analytics work, as well as foresee and advise on some of the ancillary issues that will arise from this work, such as ethics and privacy.

Some have argued that the CDO/CAO role is not needed, and instead it should be part of the charter of the CIO. I think this would be challenging. From my perspective, the CIO role is already responsible for a great many things. Adding oversight for analytics would be difficult, and I don’t know how much background or interest in analytics most CIOs have. I’m concerned that the specialized skills needed differ from those of their current roles and project set (e.g., upgrading corporate systems, overseeing global IT teams and developing IT strategy). Analytics could also be given short shrift due to many other projects and competing demands.

With this said, there may be a need for the role of a CDO to drive data governance, data collection, and change management, and a CAO role to drive awareness of analytics and bring that mindset to the executive table. However, I will qualify this by saying that it may not be completely necessary to have two distinct people performing these roles, although this set of tasks needs to be done – and they need to be done by someone in the executive suite who has the domain background, interest, and authority to drive them.

In fact, I recently had the opportunity to speak with Tom Davenport (mentioned above), who in addition to being an author, is an expert on analytics and business, a professor at Babson College, and head of the Institute for Advanced Analytics (IAA). His perspective is that the role of the CDO/CAO may be better as that of a transitional role, and perhaps warranted to drive rapid change in an organization.

The last point I want to raise is that sometimes when a new role is introduced with the intent of driving a particular initiative, those in the other areas of the organization may feel it is no longer in their charter to cover the responsibilities. So, a CDO/CAO must take care to do his job well, and also not alienate or absolve others from the responsibility of doing analytics within their functional organizations.

OK, have I scared you off yet? Many times people will ask me how to get started “doing Big Data.” I think executive sponsorship is a key step on this path. Whether you are fortunate enough to have a data-driven CEO, or want to appoint a CDO/CAO-type of role, I think either of these options are key steps on this journey. I’ll talk more about some of the other drivers in upcoming posts.

Hopefully, you are now aware of some of the considerations involved in creating executive-level sponsorship for analytics. Coming in early Q2, EMC will be offering new courses that will delve into this and other similar issues facing business leaders implementing an analytics strategy. Watch for news of the new “Data Science and Big Data Analytics for Business Transformation” courses coming soon.

Also, please join me in Las Vegas at EMC World May 6-9, where I will be discussing these issues in more detail in my breakout session, “Building Science Teams” on May 8 from 10-11 am. Hope to see you there!

Over the past year, I’ve had the privilege of being involved with an initiative at MIT called bigdata@csail; CSAIL is the Computer Science and Artificial Intelligence Lab at MIT.

Massachusetts Governor Deval Patrick kicked off this initiative in May, 2012 to show the state’s interest in partnering with academic institutions to drive innovation in technology, specifically related to Big Data. The idea is that Big Data provides many opportunities to drive change, so why not bring together experts to share ideas, research and problems and find ways to use Big Data technologies to improve what we do.

About ten companies, including EMC, are participating in this consortium, which is focused on advancing research on Big Data, and helping each other learn and solve problems related to this area. (BT recently also joined the consortium, but is not included in this graphic.)

There have been some outstanding sessions at bigdata@csail, covering a variety of topics. Everything from faster and more efficient scientific database architectures based on matrix-based table structures (SciDB), to integrating sampling techniques for lightning fast queries (BlinkDB), developing new open source collaborative analytical software (Julia), and the use of machine learning and Big Data to interpret body language and non-verbal communication between people (see Sandy Pentland’s work on sociometric badging and Living Labs).

Although I find these projects innovative and exciting, it occurs to me that many times people look to improve and push boundaries for things that we are already pretty good at, while spending less time on improving areas that are important, but may be difficult or less sexy.

The Data Analytics Lifecycle that we developed and teach in EMC’s data science classes is an example of this:

Although the third (Model Planning) and fourth (Model Execution) phases tend to get most of the attention, since these are where algorithms and predictive models come into play, the place where people spend the most time by far is in Phase 2, Data Prep. From my experience, and from input I’ve received from others who are experienced Data Scientists, Data Prep can easily absorb 80% of the time of a project. But there has been a real lag in the development of tools for data prep. Many times I see leaders who want to get their data science projects going quickly, so their teams jump right into making models, only to slide back a few phases, because they are dealing with messy or dirty data. They must then try to regroup and create predictive models.

Dealing with the data cleansing and conditioning can be a very unsexy part of a project. It can be painful, tedious, time consuming, and sometimes thankless to clean, integrate and normalize data sets so that you can later get it into a shape and structure to analyze later on. Rarely do people pound their chests at the end of a project and talk about all of the fabulous data transformations they performed in order to get the data into the right structure and format to analyze. This is not where the sizzle is, but, like many things, it’s what separates the novices from the masters. In fact, because of the amount of thought and decision-making related to how data is merged, integrated, and filtered, I believe more and more that the data prep cannot be separated from the analytics, and is intrinsically part of the Data Analytics Lifecycle and process. The reality is, if you give Data Prep short shrift, everything that comes after it is a waste of time.

From my perspective, many new tools have emerged to help simplify analytics, dashboards, reporting, handling streaming data, and even improving database architectures, but I’ve seen very little in my career that truly improves the Data Prep phase of the project. It seems to be the dirty little secret of every data science or analytical project that the time and attention spent on Data Prep is intensive. As a consequence of it being tedious and labor intensive, most organizations use only a handful of their datasets. Research presented during a recent talk at MIT indicated that large organizations have roughly 5,000 data sources, but only about 1-2% of these make it into their Enterprise Data Warehouse. This means that 98% of an organization’s data sources may be unused, inaccessible, or not cleaned up and made useful for people to analyze and use to make better business decisions.

For these reasons, I’m glad to see that people are starting to create tools to address this need. As all of the marketing hype, newspapers, media, and legitimate research tell us, 80% of new data growth is unstructured. To take advantage of it, we need to get a lot better at preparing, conditioning and integrating data.

The theme for the most recent bigdata@csail session in early April focused on Data Integration. Many researchers presented their projects and research on Big Data, and how they are trying to solve these problems. If data integration has been a problem for years, shouldn’t this turn into a much bigger problem when integrating Big Data?

The thesis is that rather than using brute force techniques to merge data together, we can use more intelligent techniques to make inferences about different kinds of data and automate some of the decision making. Take a project like Data Tamer, which strives to inject algorithmic intelligence into the data cleaning stages of the Data Analytics Lifecycle to make our lives easier and better. To give a very simple example, if Data Tamer detects that two columns of data are named differently but contain data that are highly similar above a certain threshold, then we can infer it is likely the same data, and that someone has renamed the column. Data Tamer will suggest likely columns to combine, and ask the human to choose how and when to merge data. This means some of the brute force-level work can happen with machine learning, and some of the more difficult decisions about merging data can be left to humans, who can exercise higher-level judgments based on their deep domain knowledge and experience.

Instead of using only 1-2% of the data in an organization, what decisions could we improve if we could access 10% or 20% of an organization’s datasets? MIT is doing good things to advance data integration and conditioning, but there are also other tools emerging in this area (even if they are less sophisticated) to make the data conditioning and prep easier for most people. Here are a few free tools:

1) Open Refine (formerly Google Refine) has a simple user interface to help people clean up and manipulate datasets.

2) Similarly, Data Wrangler, which emerged from Stanford, does some of the same things. Both of these are great tools with graphical user interfaces.

3) Certainly, you can also use R if you are feeling a bit more ambitious. Or, try learning Reshape2, and Plyr packages. These packages will enable you to do many, many data transformations on to help in data science projects (though R has more of a learning curve).

I encourage you to explore these tools. Getting even a little bit more conversant with data management and handling means you can greatly expand the universe of data you can explore and there will be more data out there that you can use for analysis. This will become even more critical as Big Data continues to evolve.

Please add your comments and feedback on your favorite tools and methods for data preparation. Also, for those in the Boston area, I’d like to invite you to a session I’m presenting at MIT Sloan on Big Data on April 29. Hope to see you there.

In my past few posts, I’ve discussed topics that managers and business leaders must understand to become conversant with Big Data and take advantage of Big Data in their businesses. I’ve covered topics ranging from challenging the assumptions and habits they have when making decisions to understanding trends for the future in this area and understanding the need and role of executive engagement.

After this build up, I’m happy to announce that the big day is now here, and we are launching two new data science and Big Data analytics courses for executives and business leaders.

About a year ago, we launched our first data science education offering, Data Science and Big Data Analytics, which was geared to aspiring Data Scientists. That course was developed for people with the requisite technical background who wanted to take the next steps to learn more about machine learning, data mining, in-database analytics and other data science skills.

Now we are offering two new courses focused on giving business leaders and their teams the skills and knowledge to implement a Big Data strategy and lead an analytics team to success. The first new course is called Introducing Data Science and Big Data Analytics for Business Transformation and is a 90+ minute Executive Module on Big Data and data science, and the key business benefits of introducing Big Data into an organization. The four main benefits an organization can garner from successfully implementing data science projects are:

1) Financial: recent research shows very high ROI on data science projects, with an average payback of 11 to 1.

2) Decision quality: improved decision quality by using data to drive the right decisions and challenge underlying assumptions.

3) Data quality: data gets better when people actively begin using data in their organizations, discovering anomalies in data that has been previously unused or unexamined.

4) Collaboration: better collaboration due to cross-functional teams needing to work with each other on challenging business problems. This paves the way for better understanding about data problems, and these improved relationships make these projects easier in the future.

The second new course is Data Science and Big Data Analytics for Business Transformation. Many companies decide they want to “start doing Big Data,” but don’t know how to start. Or sometimes they appoint someone to be in charge of this new important-sounding initiative, but that person needs a roadmap in order to be successful. Our aim in developing Data Science and Big Data Analytics for Business Transformation is to help these individuals. This one-day course provides case studies, a Data Analytics Lifecycle to guide them on managing projects, overviews of several analytical methods with real-world business examples, and also some understanding of how to use analytics to drive change and innovation.

We were fortunate to have Patricia Florissi, EMC’s CTO for Global Sales, join us and provide input on the course development. Here is her perspective on why Big Data matters and why leaders need to learn some of these key points:

With these new courses, we now have data science and Big Data analytics courses for people at different levels within the organization and also who may be at different stages in their adoption of Big Data. Executives and others new to Big Data may want to take the new 90+ minute Executive Module; new leaders of data science teams will learn the key points for what they need to know from the one-day course; and aspiring data scientists can take the first steps in their role with the week-long technical course. The launch of these two new courses on business transformation is a good step toward helping people at different levels of the organization be successful with Big Data.

Lastly, I wanted to share this video of John Smits, who oversees a Sales Operations & Analytics team at EMC. John talks about how he realigned his team and how he’s taken steps to pragmatically implement data science methods within Sales Operations at EMC.

Because my area of expertise and interest is Big Data and Analytics, my experience at EMC World was viewed through this lens. I think this year’s EMC World introduced a number of new ideas related to Cloud, Big Data, Trust, and Virtualization, and featured a tighter coupling of these ideas.

If we turn the calendar back to EMC World 2012, there was a clearer dichotomy between some of the session tracks related to Big Data. Although the main EMC World conference did have some sessions related to Big Data, for the most part, Big Data topics were offered via separate events: the Data Science Summit and Greenplum Connect. The Data Science Summit was a one-day conference, treated as a marquee event. There were top-name speakers, such as Nate Silver, and others from the who’s who of Data Science, including Hadley Wickham, Chris Wiggins, and many other notables. Greenplum Connect was a one-day mini-conference geared more for Greenplum customers. It had many sessions on analytics and also some on Greenplum strategy. These complementary conferences featured pure-play sessions on Data Science and Big Data, in addition to the other sessions related to these topics integrated into the main EMC World program.

Part of what changed the dynamic at EMC World 2013 for Big Data was the lack of a concurrent Data Science-focused event. However, there were still sessions at EMC World 2013 that focused on Big Data and Data Science, and I found that the people attending them, although fewer in number because there was not a data science-focused event to draw them to Las Vegas, were genuinely interested in learning about the topics.

While delivering EMC’s Data Science & Big Data Analytics course to people who want to become practitioners of Data Science, we realized there is a gap in the industry. There are now people who have responsibility for building and leading Data Science teams, but may not know how to get started. To help answer this question I delivered an EMC World session entitled “Building Data Science Teams.” I’ll share more about the material I covered in my presentation in a future InFocus post, but for now I thought I would share a few observations about the session in general.

Nearly 100 people attended my session, and the attendees seemed quite interested in the material, with the attendees ranking the presentation very highly in the post-session evaluations. As is often the case, people seemed bashful about asking questions during the session, but after we turned off the microphones, 10-15 people approached me with questions, many of whom then followed up with me during the remainder of EMC World with subsequent questions. Many of the questions and comments I received showed some emerging trends:

Potential entrepreneurs. There are a growing number of people who have new ideas and want to become consultants or advise organizations on some facet of managing data, starting or deploying analytics teams, or creating analytics teams within organizations. In addition, there are a number of people who want to start brand new businesses built around big data.
Growing concerns around security and privacy. From my perspective, these issues are really the sleeping giants related to Big Data. Once people realize and begin to understand these issues, these elements will really start to come under the microscope.
Despite people’s best efforts and the amount of media hype, these are still the early days of Big Data and people are still figuring out this whole space. I think there is still a misperception that Big Data is only really for the smart, super-geeks in IT, rather than a phenomenon that has the power to influence nearly every discipline and field. In just the past month, I’ve spoken with people trying to use Big Data to figure out how people learn most effectively; how librarians can mine scientific literature; and how recent business school graduates can start companies related to Big Data. In other words, these implementations of Big Data span psychology, education, learning theory, entrepreneurship and many more – not just IT.

Again, I mention the dynamic where people did not want to ask many questions during my session, but did want to talk offline. Perhaps it takes time to digest new ideas related to Big Data, but it also seems people think they may be the only ones lacking an understanding of Big Data, or they fear looking naïve in front of their peers. The reality is we are all in the same boat. Realize that if you are getting into Big Data now, it is not too late, as people are still trying to figure out what this thing is and how to best take advantage of it. So, I encourage you to reach out to other people, ask questions, explore, collaborate, and look for more resources to learn about Big Data. It really is still early days.

EMC is a founding member of an initiative at MIT called bigdata@CSAIL, which I wrote about in a previous InFocus post. Kicked off over a year ago by Massachusetts Governor Deval Patrick, CSAIL is the Computer Science and Artificial Intelligence Lab at MIT, and bigdata@CSAIL demonstrates the state’s interest in partnering with academic institutions to drive innovation in technology, specifically related to Big Data.

I am fortunate to participate in bigdata@CSAIL’s ongoing series of workshops, lectures, and research related to the emerging field, including the challenges, opportunities and ecosystem related to Big Data. One such session, on June 19, focused on the intersection of Big Data with Security and Data Privacy.

Security and privacy are becoming bigger and bigger concerns as we are able to generate more intelligent and sophisticated analytics with Big Data. This session was attended by a senior-level group of thought leaders, many of whom were legislators (David Vladeck, Law Professor at Georgetown, formerly of the FTC), legal experts (Daniel Weitzner), faculty (Sam Madden and others), or doctoral research students at MIT.

Although there was a tremendous amount of outstanding content at the session, which I will discuss in future posts, there were a few key takeaways and recurring themes from the day that I want to highlight immediately:

1) Security is complex and many times you still have to ask ‘who guards the guards.’ It’s not enough to secure a database somewhere, sometimes you need to secure the data from your own people, such as system administrators who may have root level access to databases and may inadvertently see data that they shouldn’t. Among the solutions offered to address this problem were performing queries against encrypted databases, and storing decryption keys elsewhere, away from the databases themselves.

2) As tricky as security can be, privacy is a completely different ball game. Many times people think of data security as having strong algorithms and ensuring that sensitive data is secured. One participant at the workshop remarked that the biggest threat to privacy is actually cryptographers. Some people believe that if they just have stronger and stronger encryption, their sensitive information will be kept private. This may be oversimplifying, but the nuance with privacy is that it’s not just about securing personally identifiable information (PII), it’s also about considering what other related pieces of information could be used together to infer PII. Therefore, you need to be very careful about the information you share, and the intentions of the people who collect it.

For example, if you give Amazon.com your address to send you a shirt you ordered, you are giving them your permission to use your address information for the purpose of sending you that shirt. However, in doing this, you should not also be giving permission to resell that address information or to sell you other things from other vendors. Therefore, there needs to be shifts in the way people and organizations think about their private information— instead of sharing their information freely, people must consider whether they are leasing their information to a specific person or entity under certain conditions for a certain period of time. Providing this control on the part of the individual was a core theme of the privacy workshop.

3) Privacy = Technology + Social + Regulation. What I mean by this equation is that technology alone will not solve this issue. Privacy is a much more subtle problem, which requires a blend of these three components. Technology and new algorithms are certainly part of the solution, as there are techniques to have intelligent systems answer questions without sharing sensitive data. By “social,” I mean that we need to incentivize accountability regarding privacy. Back to my Amazon example, how do you provide the right incentives (and disincentives) for a retailer or other organization not to use, sell or share your data inappropriately or without your permission? Remember the Google Buzz. privacy problem a while back? Google decided to do you the favor of sharing lots of information about you to your friends or people unknown to you in order to jumpstart social circles. This turned into a disaster and it was quickly shut down.

This leads me to the last piece of the equation: regulation. This is a very tricky element. In the US, the Privacy Bill of Rights has gotten attention, and in Europe there are laws governing the use of personal data. None of these are perfect, but they are a start and they help. As David Vladeck, former Director at the Federal Trade Commission and a Professor at Georgetown Law, put it: “the FTC is a bit like the gym teacher at the high school prom. The gym teacher doesn’t want to be there, and the students don’t want him there, but he needs to be there just to make sure things don’t get too out of hand.” This is to say that regulation is important, but mostly as a guideline to remind people how to act appropriately and ethically in fuzzy situations.

As we train people to become Data Scientists and design sophisticated algorithms, I encourage practitioners and leaders to consider the trade offs between utility and privacy. This is a new, emerging and complex area, and I’ll explore it further in future posts.

I delivered a session called “Building Data Science Teams” (viewable on YouTube) at EMC World. As I mentioned during the presentation, as much as the media is focusing on the shortage of Data Scientists, the reality is that to do it well, you must consider data science as a team sport. The success of a project is not just on the shoulders of Data Scientists; it also requires a number of other roles within a team, such as Data Engineers, BI Analysts, and strong stakeholders. See the image below for an overview of the seven roles that are common in these projects:

In addition to a shortage of Data Scientists, we are also experiencing a shortage of data savvy managers. These are the people who understand how to make better decisions with data, and who also work with their teams to design insightful ways to use data to test ideas.

In my talk at EMC World, I talked about many of the elements that people need to consider as they build data science teams and architect this capability within an organization. My session focused on these 4 main areas:

Data Science Team. The roles and competencies required for a high-performing data science team.

Developing Data Science Capabilities. Deciding which model to choose for developing data science capabilities, because not everyone needs to build their own team. In other words, you should consider whether it is best to transform an existing team, build a brand new one, outsource, or crowdsource specific data science problems.
Organizational Model. Organizations may choose to have a centralized data science team, a de-centralized one, or take a hybrid approach. The key is to understand the trade-offs of each path and be thoughtful in the path you pursue.
Executive Engagement. Getting executive engagement and support is critical. This is what separates a company with localized analytical teams from another that uses analytics to really inform its strategy. Think of Netflix, Amazon, or the Oakland A’s (Moneyball) in this latter category.

During EMC World, I also did a short presentation in the EMC Global Services Booth that focused on implementing Big Data projects and the shortage of people with the necessary data science skills. You can watch a short video from my presentation in which I break down the key roles that are needed as part of a data science team and the skills that are necessary at each level.

Much of the content for these sessions was taken from the data science and Big Data analytics courses that I helped develop at EMC. The newest course, designed specifically for business leaders, Data Science and Big Data Analytics for Business Transformation, is offered as a free, 90 minute executive module, as well as a more comprehensive one-day class on DVD. For those wishing to develop greater skills as a data science practitioner, Data Science and Big Data Analytics is a five-day course offered via DVD or in classroom.

The initial feedback from these courses has been very positive. I’ve received emails from Directors of Analytics at companies who have attended the class or watched it on video, and feel it was on the mark and important to train people on these topics. I expect as more organizations figure out that Big Data is critical to their future success, they will tap more of their leaders to get trained so they can make an impact with Big Data.

My last InFocus blog post, “Building Data Science Teams,” included video footage from a presentation I did at EMC World about the key roles and skills needed to implement a Big Data project.

Here is a second short video from the same presentation at EMC World, which describes the four main benefits of implementing data science projects: Financial, Decision Quality, Data Quality and Collaboration. I also discuss EMC’s data science curriculum and how it helps business leaders and IT professionals develop the Big Data analytics skills needed to contribute to a data science team.

For more details on this topic, I wrote a previous post for InFocus, entitled “What Business Leaders Need to Know to Start ‘Doing Big Data’” which covers this topic more extensively.

Every day, people are trading privacy as a commodity in exchange for goods and services, although they may not realize such an exchange is actually taking place. I touched on this topic a bit in a previous InFocus post, “Big Data vs. Big Brother.” Google’s free services, such as Gmail, are a prime example of this trade off. In this context, people are getting “free” email usage, but in exchange they are giving Google permission to scan their emails and provide targeted advertising back to them based on the content and habits of their email messages.

On a larger scale, many people have mixed reactions to the recent NSA PRISM project and Andrew Snowden. Here is a brief overview of PRISM, from Wikipedia, in which federal agencies explored available data to determine who was a potential security risk or threat.

As you can see, Google was one of many providers sharing data that became an input for the surveillance monitoring.

In the case of PRISM, data monitoring, and the mining of Big Data was conducted under the auspices of the public good. I’ve seen many different reactions to the NSA PRISM articles and publicity. Many people seemed shocked that this activity was taking place. For some, it raises concerns that others are reading their emails, Facebook updates and Likes, tweets, and other things in the name of keeping us safe or otherwise monitoring suspicious behavior. Others have commented to me…

“Well, of course the NSA is doing this. What do people think they are paid to do?”

“They can go ahead read my tweets. I don’t care. I’m doing nothing wrong.”

This is all true – most people are not doing anything wrong, and are not running large scale surveillance projects from their basements. However, similar data analysis can have a more direct impact. Consider another instance that generates data, such as buying a roto-fryer online for a relative and sending it as a holiday gift. People mining your clickstream data may infer that it is an item for your own use and offer you similar products.

What if health insurance carriers decided to analyze clickstream data to evaluate risk levels based on your online behavior? For instance, would it be reasonable to assume that people frequenting sites to purchase roto-fryers or downloading recipes from Paula Deen may have less healthy habits than those who frequent webmd.com or sites about exercise and healthy eating? What if insurance prices were influenced by this online browsing or shopping? Is that fair or reasonable? Should you only browse healthy food sites?

Whatever you think of the PRISM case, it does raise some provocative questions when you begin to think about the implications of privacy and Big Data in regard to economic problems and markets that are predicated on shielding information from other participants.

Consider the health insurance industry again. Typically, insurance carriers estimate the amount of risk an individual represents. They can analyze a huge amount of information about someone to ascertain their risk level — Do they smoke? Are they overweight? What is their family history? — and based on their perceived level of risk, the individual is offered a price for their health insurance. For group insurance plans, companies will benchmark the overall perceived level of risk for the group, and then offer a group rate for the insurance. Although companies can gauge the aggregate amount of risk for a group, a company will not know for sure which specific individual is the most likely to contract an illness, and individual risk profiles are not disclosed.

Imagine what would happen if an individual’s profile were known in this case? If the company could determine who had the highest likelihood of contracting certain illnesses or diseases, they might choose to only insure certain people, and not cover those at high risk. This individual data suddenly becomes a very valuable dataset, which allows insurers to understand in a very specific way those with a very high likelihood of contracting an expensive illness and those at low risk. Using this data they could opt to adjust pricing accordingly. In other words, without a certain level of privacy (and ethics) and protection of sensitive data, the health insurance market would drastically change or break down.

Big Data has also been used to infer human behavior at a granular level. Based on the geospatial data in your phone, researchers are able to determine how active a person you are, how much you exercise you get, how much you leave your home, and the social circles in which you travel. An interesting example of this is the Data for Development (D4D) Challenge, which analyzed aggregated cell phone data in Ivory Coast to improve standards of living. If this information were shared in certain contexts, it could increase the likelihood that predictions are made about specific people contracting certain medical conditions. The World Economic Forum refers to this as “Reality Mining,” and has a great (and short) report on this topic, which discusses a “new deal” around privacy and data ownership.

These are just a few examples that have sparked my interest in understanding the importance of privacy in the era of Big Data. I am in favor of data sharing because it is what enables Data Scientists to run experiments on data that derive new value and insights. I also believe there are tremendous opportunities to curate and analyze Big Data for the public good. However, we also need to be thoughtful and ethical about how data is shared in order to protect identifying information, and we need to be mindful of the disruptive effect it may have on various markets.

Big Data today is much like the World Wide Web was in the early 1990s. Back then, the Internet was so new, so different, and so revolutionary, it required completely new operating models, and people struggled to understand what it was, how to use it and how to best take advantage of it.

And because the World Wide Web was so new, there were few formal education programs available. To learn about the Internet and the mysterious World Wide Web, many groups formed organically to learn from each other and help create a community of people who understood and could figure out this new phenomenon together.

Now fast forward to our current day and time. Today we are dealing with this messy, hazy, and tantalizing thing called Big Data. There are many organizations popping up to help educate people (mine included, as I work in EMC Education Services), and many new university programs emerging to meet the demand and fill the skills gap for Data Scientists. As in the early days of the Internet, there are also many self-forming, informal communities that have emerged for people to help each other make sense of this space. One such community is the Meetup.

Meetups are just what they sound like — groups of people meeting up to talk about all kinds of topics. Much like Big Data itself, the variety and volume of Meetups is simply staggering. In the Boston area alone, there are more than 3,000 Meetups nearby, ranging from groups about hiking, Frisbee contests, food and technology entrepreneurship, venture capitalists, toy puppy-owner affinity groups, education and data mining, technology and of course, Big Data.

I have found very active Meetups where people are learning about Big Data, Predictive Analytics, Data Scientists, and other related topics such as Hadoop, Cassandra, and Django. One interesting thing about Meetups is that the people attend on their own time, often after a full day of work, sometimes paying attendance fees out of their own pocket, or volunteering to help out. In other words, their behavior shows they are genuinely interested in the subject matter.

I volunteered one recent evening to speak at a local Data Scientist Meetup in Cambridge, MA.

The Meetup was held at the Microsoft New England Research and Development Center, affectionately called “The NERD Center.”

As you can see from these photos, the NERD center is a terrific facility, with large open meeting rooms and a view of the Charles River. My Meetup talk focused on “Building Data Science Teams,” a topic that I also presented at EMC World in Las Vegas several months ago. Although my audience in Vegas was somewhat quiet, my audience in Cambridge was very vocal and asked many questions. My Las Vegas talk generated only a few questions, but in Cambridge I got 30-40 questions over a span of 90 minutes.

Here is a link to my presentation and to video footage of the Meetup (thanks to Kate Hutchinson for the video recording).

As I speak on this topic, some common themes emerge in the audiences’ questions:

How do I learn about Big Data and get started with it?
How do I get a Big Data job?
What are the roles are on a data science team? How does a Data Scientist differ from a Data Engineer or a Database Administrator, and how does this distinction change with new tools, such as Hadoop?
What organizational models are there for Big Data and Analytics in an organization? How do I choose the right model, and what can I infer from each option?
How do I deal with sensitive information when the data requires privacy and security?

In future InFocus posts, I will explore these questions. Are these issues relevant to you? If there are other questions you would like to see answered, please feel to post your questions in the comments section.

When I developed a new Data Analytics Lifecycle for EMC’s Data Science & Big Data Analytics course in 2011, I had no idea the attention it would receive. Although I have been doing analytical work for most of my career, I needed to do considerable research to create a solid process for others to follow. After some preliminary research, I realized that there were surprisingly few existing frameworks for conducting data analytics.

The best sources that I came across were these:

CRISP-DM, which provides useful inputs on ways to frame analytics problems and is probably the most popular approach for data mining that I found.
Tom Davenport’s DELTA framework from his text “Analytics at Work.”
“MAD Skills: New Analysis Practices for Big Data” provided inputs for several of the techniques mentioned in Phases three to five of my Data Analytics Lifecycle which focus on model planning, execution, and key findings.
Doug Hubbard’s Applied Information Economics (AIE) approach from his work “How to Measure Anything.” The focus of this work differs a bit from a classic data mining approach. Hubbard’s approach emphasizes estimating and measuring for the purpose of making better decisions. It has some very useful ideas, and helps one understand how to approach analytics challenges from a unique angle and treat them more like decision science problems.
The Scientific Method. Although it has been in use for centuries, it still provides a solid framework for thinking about and deconstructing problems into their principal parts. One of the most valuable ideas of the scientific method relates to forming hypotheses and finding ways to test ideas.

After reading these other approaches to problem solving, I read additional industry articles, and also interviewed multiple data scientists, including several now at Pivotal Data Science Labs, as well as Nina Zumel, a Data Scientist at an independent company, Win-Vector.

This research fueled the creation of a new model for approaching and solving data science or Big Data problems, which is portrayed in this diagram:

This diagram was designed to convey several key points:

1) Data science projects are iterative. Each phase does not represent static stage gates, but reflects the cyclical nature of real-world projects.

2) The best gauge of advancing to the next phase is to ask key questions to test whether the team has accomplished enough to move forward.

3) Ensure teams do the appropriate work both up front, and at the end of the projects, in order to succeed. Too often teams focus on Phases two through four, and want to jump into doing modeling work before they are ready.

I’ve seen people get excited about this approach when taking our data science classes, and have talked about it online, in blogs and even in books. Last year, I co-authored a blog series with EMC Fellow Steve Todd describing how to apply the Data Analytics Lifecycle approach to measure innovation at EMC. This work has been cited many times, both in terms of the project itself (which was mentioned in Business Week) and the methodology, which was highlighted in CRN magazine. In addition, it was also recently featured in Bill Schmarzo’s new book on Big Data.

This Data Analytic Lifecycle was originally developed for EMC’s Data Science & Big Data Analytics course, which was released in early 2012. Since then, I’ve had people tell me they keep a copy of the course book on their desks as reference to ensure they are approaching data science projects in a holistic way.

I’m glad that practitioners, theorists, and readers have found this methodology useful. If you would like to learn more about frameworks for approaching Big Data projects, I’d suggest you check out our EMC Education course materials on Data Science and also review some of the resources mentioned above.

Why Teach Data Science?

What is the Profile of a Data Scientist?

How Do You Make the Move From Business Intelligence to Data Scientist?

Moving from STEM Grad to Data Scientist

Making the Move from Business Analyst to Data Scientist

A Data Scientist View of the World, or, the World is Your Petri Dish

Top 10 Observations from Strata & Hadoop World

Digital Universe Study Underscores Skills Gap for Big Data

How to Make Better Decisions Using Big Data

From Data Driven to Data Science Driven

Do You Need a Chief Data Officer?

THE ROLE OF THE CDO/CAO

The Dirty Little Secret of Big Data Projects

What Business Leaders Need to Know to Start “Doing Big Data”

Big Data: Still Early Days? A View From EMC World

Big Data vs. Big Brother

Building Data Science Teams

The Four Main Benefits of Data Science Projects

Big Data vs. Big Dollars

Building Data Science Teams at the Data Scientist Meetup

The Genesis of EMC’s Data Analytics Lifecycle