Gloria Lau likes to say that her career in data science started “before there was even a thing called data science.” She was a PhD candidate at Stanford University, researching legal informatics – that is, using machine learning and data mining techniques to understand the law. “The law is massive, and massively complicated, so that was the beginning of big data for me – sorting through the linkages in citations and comparing and contrasting them,” Lau says. “In current times, I suppose that’s called data science.”
Out of Stanford she went to work for Thomson Reuters, which has one of the world’s largest legal datasets, then moved to LinkedIn where she was senior manager of data science. She recently moved to Timeful, where she leads a team to thread together data on time management – a project Lau calls “a hugely challenging and important data problem that I am passionate about.” She’s also a consulting faculty at Stanford – still trying to crack legal informatics – and is keynote speaker at Big Data TechCon on Tuesday, October 28. Between it all, she found time to answer three questions for Actuate.
Actuate: In your Big Data TechCon keynote, you’ll discuss the “right order of doing data science with limited resources.” What are some common data science challenges organizations face, and what advice do you have for organizations trying to overcome them?
Lau: On a 50,000-foot level, the common data science challenges organizations face are similar to the challenges faced by any other division of a company: how to effectively deploy your limited workforce on the highest priority tasks in the most optimal order. In bigger organizations, where workers tend to be (and can afford to be) more specialized, oftentimes the challenge is to assimilate data scientists into vertical teams so that both sides understand the value-add the other provides.
As for advice, in small startups – where resources are exceedingly constrained – the best strategy often is to understand the one metric that needs to be moved. Then you can focus all data science energy – be it an analytics platform, a recommendation product, or an experimentation layer – to delivering on that one metric.
Actuate: Talk about some favorite data science projects you’ve done professionally. What were the objectives, the data sources, and your methods, and what did you find?
Lau: The higher education initiative is by far my favorite project at Linkedin. Linkedin has awesome data on your career trajectory, from where you go to school to where you go to work, including what position at what company. The data science behind the project was massive, ranging from standardizing and normalizing the data, to computing similarity between schools, to ranking universities.
One key lesson we learned is that you should standardize your data as soon as you can, and build that layer into the UI as typeaheads [i.e. buffered keystrokes that allow users to key ahead] as soon as you can. Slides from my presentation on the project, given at Qcon in 2013, are here: http://www.slideshare.net/GloriaLau1/qconsf-applied-machine-learning-and-data-science-track
At Timeful, we are building an intelligent calendar to make smart suggestions on when you should do what, given the tasks and habits that you want to complete. It is a super-challenging, multi-objective optimization problem, and we have a really interesting data set at the population level. We then personalize our algorithm given how much usage we see from an individual, and that requires some heavy machine learning algorithms to classify users, tasks, etc.
The key lesson here is to hone in on the most important metric and tie that to user groups. You do this because your algorithm is never perfect, but you want it to perform the best for your key user groups.
Actuate: In addition to your work at Timeful, you’re a consulting associate professor at Stanford. Compare data science in academic and business settings – what’s different, and where is there overlap?
Lau: Data science in academia and industry are quite different. The biggest problem academic research faces is the lack of data, while the biggest problem industry faces is the lack of trained data scientists. I say marry the two! In all seriousness, there are lots of programs to bridge the two, and I have hired recent grads as well as more seasoned professionals. I advise students to code as much as you can, contribute to some open source projects if they resonate with you, and don’t shy away from datasets that need scrubbing and cleaning.
Gloria Lau will give her keynote talk, “Building Data Products: The Right Order of Things,” on Tuesday, October 28 at 11:45 a.m. at Big Data TechCon in San Francisco. Also on the schedule: Mark Gamble, Director of Technical Marketing at Actuate, will present “The Art of Possible: from Raw Data to Real Decisions in 10 Minutes” at a Lightning Talk Session on Tuesday, October 28, at 3:15 p.m. Visit the Actuate at Booth #206 to view data visualization tools, dashboards and interactive reports.
Photo credit: Timeful, used with permission.