Thoughts on CS7646: Machine Learning for Trading

The 2019 spring term ended a week ago and I’ve been procrastinating on how ML4T (and IHI) went. I’ve known all along that writing is DIFFICULT, but recently it seems significantly more so.

Perhaps its because I’ve noticed this blog has been getting a lot more traffic recently. This includes having Prof Thad Starner commenting on my post for his course on Artificial Intelligence. This has increased my own expectations of my writing, making it harder for me to start putting pen to paper.

To tackle this, I looked to the stoicism techniques (i) to decide if something is within my locus of control, and (ii) to internalise my goals. Is it within my control how much traffic my writing receives? No. Is it within my control how much feedback I get on my writing? No.

Instead, what is within my control is writing in a simple and concise to share my views on the classes, so others can learn from them and be better prepared when they take their own classes. This has been the goal from the start—I guess I lost track or forgot about it over time, and got distracted by other metrics.

With that preamble, lets dive into how the ML4T course went.

Why take the course?

My personal interest in data science and machine learning is sequential data, especially on people and behaviour. I believe sequential data will help us understand people better as it includes the time dimension.

In my past roles in human resource and e-commerce, I worked with sequential data to identify the best notifications to send a person. For example, you would suggest a phone case after a person buys a phone, but not a phone after a person buys a phone case. Similarly, in my current role in healthcare, a great way to model a patient’s medical journey and health is via sequential models (e.g., RNNs, GRUs, transformers, etc). I’ve found that this achieves superior results in predicting hospital admissions and/or disease diagnosis with minimal feature engineering.

Thus, when I heard about the ML4t course, I was excited to take it to learn more about sequential modelling—stock market data is full of sequences, especially when technical analysis was concerned. In addition, framing the problem and data from machine and reinforcement learning should provide useful lessons that can be applied in other datasets as well (e.g., healthcare).

Continue reading

Advertisements

What does a Data Scientist really do?

As a data scientist, I sometimes get approached by others on questions related to data science. This could be while at work, or at the meetups I organise and attend, or questions on my blog or linkedIn. Through these interactions, I realised there is significant misunderstanding about data science. Misunderstandings arise around the skills needed to practice data science, as well as what data scientists actually do.

Perception of what is needed and done

Many people are of the perception that deep technical and programming abilities, olympiad level math skills, and a PhD are the minimum requirements, and that having such skills and education qualifications will guarantee success in the field. This is slightly unrealistic and misleading, and does not help to mitigate the issue of scarce data science talent, such as those listed here and here.

Similarly, based on my interactions with people, as well as comments online, many perceive that a data scientist’s main job is machine learning, or researching the latest neural network architectures—essentially, Kaggle as a full time job. However, machine learning is just a slice of what data scientists actually do (personally, I find it constitutes < 20% of my day to day work).

How do these perceptions come about?

One hypothesis is the statical fallacy of availability. For the average population, they would probably know about data scientists based on what they’ve seen/heard on the news and articles, or perhaps a course or two on Coursera.

What’s likely to be the background of these data scientists? If it’s from this article on the recent Turing Award for contributions in AI, you’ll find three very distinguished gentlemen who have amazing publishing records and introduced the world to neural networks, backpropogation, CNNs, and RNNs. Or perhaps you read the recent article about how neural networks and reinforcement learning achieved human expert level performance, and found that the team was largely comprised of PhDs. If it’s from a course, the person is likely to have a PhD, and went through deep mathematical proofs on machine learning techniques. Thus, based on what you can think of, or what is available in memory, many people tend to have a skewed perception on what background a data scientist should have.

The same goes for what data scientists actually do. Most of the sexy headlines on data science involve using machine learning to solve (currently) unsolvable problems, everything from research-based (computer games) to very much applied (self-driving cars). In addition, given that the majority of data science courses are on machine learning, its no wonder that the statistical fallacy of availability would skew people towards thinking that machine learning is the be all end all.

Continue reading

A Primer on Electronic Health Records (EHRs)

My team recently had a brownbag on the types of healthcare data available and I took the opportunity to share a bit about electronic health records. Other types of data shared included the MIMIC dataset and imaging data (e.g., X-rays, CTs, MRIs). I received feedback that it was a useful “EHR 101” and thought to share it here too.

Healthcare has a problem

In most places around the world, primary care and hospitals maintain their own, distinct systems for electronic medical record (EMR) data. As a result, patient and medical data across different providers are incompatible with each other, leading to a lack of interoperability.

Providers want to control all digital records of their patients, ensuring patient retention. This leads to data being siloed at each institution. Patients’ prescriptions, lab tests, diagnosis, etc. are not visible across institutions, contributing to significant wastage.

Picture1.png

The other problem is that of poor usability. Often, these systems don’t account for human computer interaction principles. Thus, clinicians often spend more time talking to their laptop than to the patient, contributing to clinician burnout. Furthermore, while the system works and data is dumped in, it is often in such a mess that it is impossible to use.

Enter the electronic health record (EHR).

Continue reading

Data Science and Agile (Frameworks for effectiveness)

This is the second post in a 2-part sharing on Data Science and Agile. In the last post, we discussed about the aspects of Agile that work, and don’t work, in the data science process. You can find the previous post here.

A quick recap of what works well

Periodic planning and prioritization: This ensures that sprints and tasks are aligned with organisational needs, allows stakeholders to contribute their perspectives and expertise, and enable quick iterations and feedback

Clearly defined tasks with timelines: This helps keep the data science team productive and on track, and being able to deliver on the given timelines — the market moves fast and doesn’t wait.

Retrospectives and demos: Retrospectives help the team to improve with each sprint, and provide feedback and insight into pain points that should be improved on. Demos help the team to learn and get feedback from one another. If stakeholders are involved, demos also provide a view into what the data science team is working on.

What about aspects that don’t work well? And how can we get around them?

Difficulty with estimations: Data science problems tend to be more ill-defined, with a larger search space for solutions. Thus, estimations tend to be tricker with a larger variance in error. One way around this is to have budgets for story-points / man days, and to time-box the experiments.

Rapidly changing scope and requirements: The rapidly evolving business environment may bring with it constantly changing organizational priorities. To mitigate this, have periodic prioritisations with stakeholders to ensure alignment. This also helps stakeholders better understand the overhead cost of frequent context switching.

Expectations for engineering-like deliverables after each sprint: Project managers and senior executives with an engineering background might expect working software with each sprint. This may require some engagement and education to bring about mindset change. While the outcome from each sprint may not be working code, they are also valuable (e.g., experimental results, research findings, learnings, next steps).

Being too disciplined with timelines: A happy problem is being too efficient and aligned with business priorities. Nonetheless, a data science team should be working on innovation. To take a leaf out of Google’s book, a team can build in 20% innovation time. Innovation is essential for 10x improvements.

How to adapt Agile for Data Science

In light of the points discussed above, how can we more effectively apply agile/scrum to data science?

Here, I’ll share some frameworks/processes/ideas that worked well for my teams and I — hopefully, they’ll be useful for you too. Namely, they are:

  • Time-boxed iterations
  • Starting with Planning and Prioritisation, Ending with Demo and Retrospective
  • Writing up projects before starting
  • Updated mindset to include innovation

Continue reading

Data Science and Agile (What works, and what doesn’t)

Since I last posted on moderating a panel on Data Science and Agile, some have reached out for my views on this. This topic is also discussed among the data science community, with questions on how agile can be incorporated into a data science team, and how to get the gains in productivity.

Can agile work well with data science? (Hint: If it can’t, this post, and the next, won’t exist.)

In this post, we’ll discuss on the strengths and weaknesses of Agile in the context of Data Science. At risk of irritating agile practitioners, I may refer to Agile and Scrum interchangeably. Nonetheless, do note that Scrum is an agile process framework, and there are others such as Kanban, etc. In the next post, I’ll share some agile adjustments and practices that have proven to be useful—at least in the teams I’ve led. Stay tuned!

Data science is part software engineering, part research and innovation, and fully about using data to create impact and value. Aspects of data science that work well with agile tend to be more of the engineering nature, while those closer related to research tends not to fit as well.

Continue reading

Thoughts on CS6601: Artificial Intelligence

Happy holidays! Have just completed the exceptionally difficult and rewarding course on artificial intelligence, just as my new role involved putting a healthcare data product into production (press release here). The timing could not have been better. The combination of both led to late nights (due to work) and weekends completely at home (due to study).

Why take this course?

I was curious about how artificial intelligence would be defined in a formal education syllabus. In my line of work, the term “Artificial Intelligence” is greatly overhyped, with snake oil salesmen painting pictures of machines that learn on their own, even without any new data, sometimes, without data at all. There are also plenty of online courses on “How to do AI in 3 hours” (okay maybe I’m exaggerating a bit, it’s How to do AI in 5 hours).

Against this context, I was interested to know how a top CS and Engineering college taught AI. To my surprise, it included topics such as adversarial search (i.e., game playing), search, constraint satisfaction, logic, optimzation, Bayes networks, just to name a few. This increased my excitement in learning about the fundamentals of using math and logic to solve difficult problems.

In addition, the course had a very good reviews (4.2 / 5, one of the highest), with a difficulty of 4.3 / 5, and average workload of 23 hours a week. Based on these three metrics, AI was rated better, more difficult, and requiring more time than Machine Learning, Reinforcement Learning, and Computer Vision—challenge accepted! It was one hell of a ride, and I learnt a lot. However, if I had to go back in time, I’m not sure if I would want to put myself through it again.

What’s the course like?

The course is pretty close to the real deal on AI education. Readings are based on the quintessential AI textbook “Artificial Intelligence”, co-authored by Stuart Russell and Peter Norvig. The latter is a former Google Search Director who also guest lectures on Search and Bayes Nets. Another guest lecturer is Sebastian Thrun, founder of Udacity and Google X’s self-driving car program. The main lecturer, Thad Starner, is an entrance examiner for the AI PhD program and draws from his industry experience at Google (where he led the Google Glass development) when structuring assignments.

Continue reading

Data Science and Agile—can or not?

Recently, I was invited to moderate a panel on the topic “Data Science and Agile–can or not?” It’s a Singlish way of asking if Agile can be applied in the domain of data science. The panel was held in conjunction with GovTech’s inaugural STACK conference for developers, programmers, and technologists from the private sector.

7589883120_IMG_1347

Who was in the panel?

The panel involved the following guests, from right to left in the photo above:

  • Ivan Zimine: Physicist and neuroscientist who works on complex systems while applying open source and open practices.
  • Adam Drake: Formerly Chief Data Office at Skyscanner and Redmart, with an exemplary record in the design, development, and delivery of cost-effective, high performance tech teams and systems.
  • Steven Koh: Director of Government Digital Services at GovTech leading the Agile Consulting and Engineering team and evangelising agile development in the government.
  • Eugene Yan (that’s me as moderator): Formerly VP of Data Science at Lazada (acquired by Alibaba), currently Senior Data Scientist at uCare.ai.

Continue reading