Git and GitHub for Data Scientists - What I learnt in Open Source Contribution

By Cheuk Ting Ho

Elevator Pitch

Knowing how to use Git and GitHub would be an advantage for Data Scientist. Working with the engineer team to deploy your work requires basic software development practices. In this talk I will tell you what I learnt from contributing to Open Source on GitHub. (bonus: using git on ipynb file)

Description

Do you know what is Git and GitHub? Do you have an GitHub account? How do you use it? Do you just find cool libraries in it and copy the code? Or you have actually made a Pull Request before? Inspired by a workshop in PyData Berlin, I would like to introduce more Data Scientist into Git and GitHub. Why should we use it and how. Also why we should contribute to Open Sconce Projects and how to do it.

In this talk, we will hold your hands and guide you through a journey from zero to a open source contributor. First we will introduce the tool: Git, explain why we should use version control and guide you step by step form how to initiate a repo to merging changes to the master branch. After that, we will step up and explain how to use GitHub from forking (copying) a online repo to making a pull request (PR) to it.

As it is a talk for data scientist, we would also talk about how to use Git with iPython Jupyter Notebook, a popular IDE tool in data science. As the ipynb file is more than just text, it is indeed a custom JSON data-structure, every time running the Notebook will generate unnecessary changes behind the scenes and make tracking difficult. We will introduce a way to apply Git filter to avoid this problem.

After learning the basics, we will go through some mistakes (that I made so you don’t) and challenges that you may expect in making a PR till it is successfully merged by the core developers. We will also go through different ways that you can usually contribute to a project, from easiest to hardest, in my opinion. We will also wrap up the talk by suggesting some reasons why you should contribute to open source so hopefully you can find some motivation form there.

This talk is for data scientist who use open source libraries (including pandas, numpy, scikit-learn etc) everyday and have no or less experience using Git and GitHub. After the talk, you will learn basics of Git and GitHub, understand why using version control, how to use Git with Jupyter Notebook and how to start practicing by contributing to Open Source projects.

Notes

Outline:

  • What is Git? What is GitHub? Why version control?

  • Git: work flow of making a branch, a commit and a merge

  • GitHub: work flow of making a PR form froking

  • Bonus challenge: using Git with our favorite iPython Jupyter Notebook

  • Common mistake/challanges in contributing to a (open source) project:
    1. working on master branch, big no no
    2. PEP 8 (what is it and how to follow it)
    3. passing the tests: Travis and/or CircleCI
    4. reviews, changes, commit again and review…
  • the joy or PR merges

  • start contributing to your favorite project: where to start?
    1. reporting bugs (provide reproducable case)
    2. improving documentation (Pandas Sprint)
    3. bug fixing
    4. reviewing and comment others’ PR (in a constructive way)
    5. suggesting new features (and the discussion follows)
    6. implementing new features
  • Why contributing to Open Source:
    1. practicing using Git and making PR to projects
    2. learn the tool that you use
    3. make the tool better
    4. (with the community) own the tool
    5. make the (python/open source) world better :-)