data science vs statistics talk at the RSS

in data science, statistics, royal statistics society

Tonight I attend a debate held at the Royal Statistics Society. It was en entertaining, and wide ranging discussion with a great panel, and fairly intelligent questions from the floor.

Two fun facts:

So, here are the notes, rough and ready, and I missed a lot in my note taking, but I hope I get a flavour of the discussion across.

Data Science and Statistics: different worlds?

Introduction

Data since needs to define itself, perhaps in order to survive. There are a shortage of people in these roles, the feeling is that statistics and computer science are both core skills.

We also need to develop some new methodologies, some aspects of statistics need to change when you work at scale. We need an automated and large scale approach to the problem.

Every company knows that they need a data scientist, but many of them don’t know what skills a data scientist needs.

Chris Wiggins (Chief Data Scientist, New York Times)

Chris has a nice story about the moment that biology got a big interest in data. Biology become awash in data as a result of genetic sequencing. Many fields are now experiencing similar issues. At the NYT he has been building a data science group to help understand how people interact with the NYT product.

Abundant data that is challenging to learn from

They have created a certification in data science at Columbia University.

David Hand (Emeritus Professor of Mathematics, Imperial College)

Was professor of statistics for 25 years at Imperial College, and then got made an emeritus professor of mathematics.

HIs statistical interests overlap topics such as machine learning, data science (ML is clearly a sub-disciple of statistics (tough in cheek)).

Has published rather a lot of books.

Wants to make a few fundamental points,

people don’t want data, they want answers

data aren’t the same as information (e.g. with pixels in digital cameras, cameras keep creating more data, but the information in the picture remains the same)

two key areas of challenges: data manipulation: automatic route finders, recommendations, the challenges there are mathematical and counting. The other kind of challenge relates to inference, what will the data say in the future, not what has happened in the past - that is statistics.

all data sets have problems, data quality is a key issue. It’s even more potentially serious for large data than it is for small data sets, the computer is an intermediary between the person and the data.

the assertion that we no longer need theories; we can just look at the data, can more often than not lead to mistaken conclusions

Big data doesn’t mean the end of small data. There are more small data sets than big data sets.

Francine Bennett (Founder, Mastodon-C)

Founder of Mastadon C. Was a maths undergraduate, pure mathematics, started a PhD, dropped out, became a strategy consultant, got bored drawing powerpoints, moved to a google strategy team, tried to solve problems with tools other than excel.

There is now a niche for companies to apply the tools that have come from online advertising, to non advertising and non banking related problems, e.g. the built environment.

A lot of what they do is a combination of software engineering and data science. Its sometimes used as a putdown, a data scientists is someone who is better at engineering than a statistician, and better at stats than a programmer, that breadth is critical to making things work.

Patrick Wolfe (Professor of Statistics, UCL / Executive Director, UCL Big Data Institute)

He has seen data science from different perspectives, studied electrical engineering and music, did a PhD that combined these, looking at systems to restore old audio recordings, his interests were very statistical in nature. Has always maintained one foot in electrical engineering and one foot in statistics. Is executive director of the the big data institute at UCL.

His personal research interests resolve around networks. His mathematical interests at the moment are about how we understand the structure of large networks.

Thinks that this is a once in a lifetime opportunity for all of us. There is an opportunity for statistics to participate in a dramatic change in how we understand and collect data. The paradigms in how we collect data are clearly changing.

What will it mean to shape the future of data science. We need to create an intellectual core.

That core is related to mathematics, statistics and computer science.

For statistics we must recognise the following paradigm shift: we all learnt about designed small scale experiments, and data was expensive. Now everyone wants to work with found data. It’s the responsibility of statisticians to help people do that. Statisticians have a responsibility to teach people how to draw inferences from found data. They can’t be the community that is always telling people what they can’t do, or how they should have collected the data.

Zoubin Ghahramani (Professor of Machine Learning, University of Cambridge)

Is an accidental statistician. Started out interested in AI and in neuroscience. Couldn’t make his mind up between these two fields, got a PhD in neuroscience, but got interested in making computers adapt to difficult problems.

An old AI paradigm was that you got intelligence in computers when you combine them with data, and where you move towards machine learning you are more successful. At this point these people needed to learn more statistics. His program now covers Bayesian non-parametric solutions (coincidentally, something my father-in-law worked on).

What is data science? He thinks that the answer depends on whether you talk to academics or people in industry.

People in industry have a very different view from the view from academia. People in industry just have problems that they want to solve. Their buzz phrases are big data, business intelligence, predictive analytics, etc. In industry the largest part of the problem is the collection and curation of the data. Then there is a small part of the fun part- the machine learning- then there is a large part of presenting and interpreting the data.

Let’s not overestimate the role of the statistician in this pipeline. There are interesting research challenges in thinking about the whole pipeline.

Academic fields are social constructs, they are networks. This is something we need to overcome, we need fewer barriers between these disciplines and we need more people with these skills.

This is an opportunity, and also a threat to statistics. If statistics does not move quickly enough there are many other disciplines that will want to jump into this space. Talking to other departments setting up cross departmental projects.

Discussion

Point is made that statistics can’t be everything, it can’t get to the point where the definition is too broad.

Statistics has remained a relatively small discipline (I’m not so sure about this). Contrasts stats with electrical engineering. One of the interesting things to watch will be how a small and traditional discipline goes through a growth phase where not everyone will understand what is going on.

The point is made that this happened a while ago with the branching out to things like biometrics - this happened 50 years ago.

The call is made to broaden, not merely shift, the skill sets of statistician.

It’s also made that the level of communication needs work (across the board). If you are told that you are stupid if you don’t know what a t-test is, then you might, as a computer scientists, just choose to try to run a neural-network and do machine learning. It might not be easier in principle as a technique, however the routes to using this technique are easier.

As computation becomes more and more personal, data science is shaping our personal relationships and that is going to draw more people into those fields.

Questions from the floor

More or less - How do you learn to be a data scientist?

The discussion focusses on how it is much easier to learn programming that it is to learn stats. The reasons for that are historical. As a profession statisticians are professionally trained skeptics. There is almost a visceral reaction to seeing people trying a bunch of things out.

The role as a statistician should not be to nay-say but to help people figure out what they can reasonably say with the data that they have.

Observation - questioner is really concerned with what is going to happen with the A-level mathematics curriculum. The good mathematicians in the class are going to have a bad experience with this new course, and get driven away from the field.

A view of statistics from a machine learning point of view is that ML might be a good way of tricking people into learning statistics!

It kind of depends on whether you view statistics from a mathematical, scientific of engineering discipline.

If you are mathematician you want to make rigours statements, if you are a scientist you want to understand the world, and if you are an engineer you want to build things. Historically statistics in some cultures has been thought of as a mathematical discipline with definite, and not always good, consequences. For example in the US to get funding for statistics you have to prove your mathematical worth, however there are many many areas of engineering where you can use statistics to help you build things, and where it is harder to get funded as a result of it’s classification as a mathematical disciple.

The point is made that we need both the adventurous and the conservative. It’s quite important that we retain the ability to be critical (would be nice if we could propagate that ability out to the general public).

It’s also agreed that stats really need to graduate beyond just being a mathematical disciple.

The rise of data journalism is referenced as a source of hope, a way to convert/communicate to non-specialists the power of statistics and data to help understand the world.

Nice question about the polls in the UK ahead of the election, where these polls found data, or designed data?

The answers are not very conclusive.

Point: industry have lots of problems - 90% computing and maybe 10% of stats. People who come from a CS background have a much better chance of succeeding than people who come from a stats background.

The panel discusses. Sometimes recruiters at companies think of machine learning as equivalent to Hadoop or Java. It’s not quite like that. You can gain a basic understanding of these tools, but to go beyond just downloading some software and running it is much harder. There is now demand in the market for people with PhD’s who have years of experience.

As the years go by you will start to see a refinement in the job descriptions, e.g. data dev ops, data engineering, data science.

There is a call to inject algorithmic thinking earlier into the statistical curriculum. (There is a two cultures paper looking at algorithm vs inferential thinking). There is a discussion of the new kinds of trade offs that will need to be navigated with found data. Teaching algorithms, stats and visualisations at the same time can help.

What is the role for a body such as the RSS, what could they most useful do to take this agenda forward?

Attract outsiders. There is a large appetite to learn the knowledge that the RSS have, but the apparent barrier to entry is a bit too high.

It’s as if statistics want to make things as difficult as possible, rather than stating with showing people the joy of statistics, showing people,the joy of discovery.

A controversial point is made, one of the non-statisticians on the panel finds classical statistics very hard, but they find Bayesian statistics very common-sensical, people might find it more intuitive to learn Bayesian statistics first, and you could learn classical statistics later.

It could be great to figure out a way to make academia speed up, and industry slow down (more discussion on this point).

How do you make industry take up what is happening in academia

The view is that there is not a gap here right now.

Question on the science part of the question from an ecologist. When it comes to asking the question of why, how do you get to an understanding of what are the appropriate questions to ask (I think that was a the heart of the question).

It’s suggested that what the service model for this domain is still needs to be worked out.

This whole discussion has been about data technology, we are really talking about engineering rather than the creation of new tools.

Often the gap is in data science talking to the business. Especially if the data disagrees with the opinion of an important person in the business.

You can teach students how to communicate by getting them to participate in a few different kinds of projects with different types of audiences.

It’s a strong tradition in statistics to work with someone from a different domain (e.g. natural scientists). Training people to help with the creation of experimental design.

Data science has the opportunity to broker between the people who have the questions and the data that they have to deal with that data.

John Pullinger - UK National Statistician - summary of the discussion

What they want to do as a society is to wellcome everyone.

As the debate went on John was thinking that the RSS was started when old fields were faced with new techniques. One of the founders of the society was Charles Babbage. Another early member was Florence Nightingale. The statistical profession that John knows are mainly in the business of making the world a better place.

Todays revolution is a once in a lifetime opportunity.

How do you help with the demand side, how do you help to educate people to make use of these tools. We need to educate the problem solvers.

How do you help to solve the supply side. John picked up four themes

There is a lot of data, it’s just starting, its going to ramp up very fast. But you have to care about what it is and where it comes from. We have to care about method. There is a new methodology emerging at the boundary of these areas. That deep thought will come from our universities. Technology is driving change. Finally we need skills, the back to the future skills set - we really have to grasp how we teach people these skills.

What are these skills? Some of them a new, but some of them are old standards.

The defence of data skeptisiim is important, but you can’t be a naysayer, it’s about curiosity. This data skepticism has to be at the heart of this toolset.

Bias is also core, every dataset has bias, understanding this bias is where you get creativity form stats and computer scientists.

Uncertainty is also fundamental.

Discerning which patters have some meaning in contrast to the random fluctuations of the universe.