What progress did I make?
Welcome to chapter 6 of the machine learning scholar adventure. I’ve been hella busy levelling up my machine learning skills since the last edition and time has flashed by for me thanks to all the fun I’ve been having! I’m also happy to report that I made my 100th tweet based on the #100DaysOfMLCode challenge 🔥 Although there were periods where I didn’t participate and days where I did some ML but forgot to tweet, it still feels good to have started and completed the challenge. My 99th and 100th challenge tweet:
For anyone learning ML or any other skill for that matter, I would highly recommend doing a 100DaysofX challenge where X is the skill you want to develop. Having a written record like this really helped maintain my enthusiasm, enabled me to appreciate the small wins along the way and also helped me become more consistent. I’m sure it can do the same for you! Given all the benefits mentioned, I’m going to keep going as it’s now become an enjoyable habit! On that note, here’s what I’ve been up to recently:
Datacamp: Data scientist with python
I’m now 86% of the way through the track and can’t wait to finish! My ability to work with data has definitely improved and this exposure to various methodologies and tools has really broadened my awareness of the data science stack. The certificates for the courses I’ve completed recently are linked below:
- Working with dates and times in python
- Writing functions in python
- Exploratory data analysis
- Analysing police activity with pandas
- Introduction to statistics in python
- Introduction to regression with statsmodels in python
- Sampling in python
- Supervised learning with sci-kit learn
I also finished the data literacy fundamentals track which gave me a high level overview of related specialties like data engineering and cloud computing. In hindsight, I wish I’d done the entirety of this before beginning the data scientist with python track as it provides basic context which helps bridge the gap to more advanced tracks.
I’m immensely grateful to the FB Flames Foundation for accepting me on their data analytics program which gives me 1 year’s access to Datacamp Premium for free (normally £250/year). Thanks to this, I’m able to work through tracks more gradually leaving more time and energy to focus on projects where truly accelerated learning happens! The program is currently on waitlist and is super popular so I encourage you to apply now to get a chance at this amazing opportunity.
I made my first mini-project with GPT3. Inspired by attending some parts of the Deep learning labsGPT3 online hackathon, I made a prototype summarisation tool for legal terms and conditions. Why? Well I don’t know about you but I hardly ever read them for online services…so I thought it would be great if GPT3 could do the reading and explain to me in simple terms. I used streamlit to serve the generated model summary. The code is available here and if you have an OpenAI GPT3 key (it’s free), you can also interact with it. See the readme in the github repo for how to get a key.
This project allowed me to explore the fundamental capabilities of GPT3 and I also learned how useful streamlit is for showcasing data-centric applications. I’m now working on a more advanced project which involves fine tuning GPT3 on a customised data set. I can’t say anymore for now because I’d rather build then tell 😉
Last time, I managed to get a score of 0.78468 which corresponded to top 15% in the rankings. With further refinements, my most recent score is 0.78947 which is in the top 8% of rankings. After tuning an an xgboost model with grid search cross-validation , I then ensembled all the tuned models together with a voting classifier. You can view my notebook here. I also tidied up the notebook as I want to make my work understandable for myself and future collaborators.
Interestingly, there are perfect scores for this competition but I found out this was because of submitting the actual answers to the test set. Given this and the abundance of other competition opportunities, it made no sense to invest more time in this competition when I’d already reached an acceptable performance level so I left the titanic to rest.
Things are getting sweeter in the phrase to phrase matching competition, I finished exploring Jeremy’s starter notebook which helped me get a private score of 0.8076 and public score of 0.7972. I started working through the iteration notebook but paused working on this so I could focus on more immediate concerns. Whilst the competition finished in June, it’s still possible to make late submissions. Although no points or ranks will be awarded, the value of the learning experience is immeasurable.
For example, just from attempting to reproduce Jeremy’s insights I learned that code competitions on kaggle don’t allow internet access. This means that training & inference must occur in separate notebooks. Finding out how to do this was non-trivial as there weren’t super clear examples I could adapt or perhaps I need to get better at searching the platform 🤔 Nonetheless, I remained tenacious and found a notebook which showed me how to save a model. I also had to find out how to reload and use my pretrained model. This search could have been more efficient if I’d stepped back, rested, progressed on another task then come back with fresh eyes.
With this knowledge, the pipeline started taking shape with separate training and inference notebooks. Whilst I’m not sure the code I’ve got is the best way to do it, I was happy that I got something working and I know more iteration with yield dividends. I still need to figure out how to manage model updating and how to clear and organise the kaggle folder. I will share the notebooks when I’ve refined them further.
I also came up with a framework for approaching competitions. This will help me with tracking my development time, experiments, time, etc much better. The stages I identified are:
- Exploration – exploring the data, making my first submission with base model
- Experimentation – trying different models, hyperparameter tuning, ensembling
- Resolution – identifying and refining candidate models further, choosing top model
- Presentation – tidying up the notebook, adding comments, references, crediting sources
My enthusiasm for Hugging Face continues to grow! I attended their superb How to teach open-source ML tools workshop. I learned more about making helpful model cards, how impactful demos can be and how the Hugging Face for classrooms initiative can be used to help others get into machine learning. This really helped me clarify the direction of a future project I wish to start for the community. I’ll keep you posted 🤗
Two Minute Papers
My favourite research development recently is from the Google AI research team. This incredible merger of OpenAI‘s GPT3, computer vision & reinforcement learning based robotics is the first compelling example of a robot butler I’ve ever seen. Google’s butler can:
- Be given a problem in natural language e.g. I’ve spilled my drink. What should I do?
- Work out the high level solution e.g. The spill should be cleaned
- Identify the individual steps for this to happen e.g. Find cleaning products, go to the spill, wipe the spill
- Check to see if steps can be followed e.g. look at itself to see if has usable arms, check the surroundings for required products
- Execute the required steps e.g. picking the products, cleaning the spill
Given the rapid progress from paper to paper, it’s only a matter of time until more complicated tasks will be possible e.g. making my favourite omelette given all the required ingredients 🙏🏾
Lex Fridman Podcast
When I’ve wanted to chillout but still get some AI intel, I’ve been listening to the excellent Lex Fridman podcast, I highly recommend the following:
- Episode 299 ~ Demis Hassabis: DeepMind – AI, super intelligence & the future of humanity
- Episode 215 ~ Wojciech Zaremba: OpenAI Codex, GPT-3, robotics, and the future of AI
Below are resources I have found useful during for refreshing and extending my knowledge in various ways
- Do you want to self-study data science? Learn from my mistakes
- Is competing on kaggle worth it? Ponderings of a kaggle grandmaster
- How to use massive AI models in your startup
What am I exploring now?
- Datacamp’s data scientist with python career track
- Radek’s meta learning book
- Chip’s introduction to machine learning interviews book
What did I learn from the challenges I’ve conquered?
- Consistency beats intensity so optimise for consistency (it’s a marathon not a sprint)
- Speaking out loud when learning helps massively with comprehension (it’s scientifically proven)
- Rely on community to keep myself engaged (recently got my first accountability buddy and my first mentee)
What are my next steps?
- Finish finetuned GPT3 endeavour
- Progress further with phrase matching competition
- Start exploring the amex competition
- Complete working through the numerai starter pack
- Open source
- Make more open source contributions
Thanks for reading my latest update 🤖 Want to connect? Or see more frequent updates of my journey? Then feel free to reach out and follow me on the platform of your choice 😄
In other news, the trolley problem has been solved. Huzzah 😂