class: center, middle # Scikit-learn ## Open source Python machine learning community Joris Van den Bossche, Guillaume Lemaître .affiliations[   ] ??? Notes for the _first_ slide! --- class: center, middle # Python's scientific ecosystem #### ## Thanks to Jake VanderPlas for the figure --- class: center, middle, bgheader background-image: url(img/JakeVdP-ecosystem1.svg) background-size: cover --- count: false class: center, middle, bgheader background-image: url(img/JakeVdP-ecosystem2.svg) background-size: cover --- count: false class: center, middle, bgheader background-image: url(img/JakeVdP-ecosystem3.svg) background-size: cover --- count: false class: center, middle, bgheader background-image: url(img/JakeVdP-ecosystem4.svg) background-size: cover --- count: false class: center, middle, bgheader background-image: url(img/JakeVdP-ecosystem5.svg) background-size: cover --- class: center, middle # Scikit-learn --- .center[  ] ### Machine learning for everyone
--- from expert to beginners
* Users everywhere: academia, startups, large companies * Development focused on collaboration & quality * Always getting better: faster and easier * Sustainability challenging --- .center[  ] ### Library of Machine Learning algorithms ### Open Source project ### Python / NumPy / SciPy / Cython ### Simple **fit** / **predict** / **transform** API ### Model Assessment, Selection, Ensembles --- background-image: url(img/sklearn-flow-1.png) background-size: contain --- background-image: url(img/sklearn-flow-2.png) background-size: contain --- background-image: url(img/sklearn-flow-3.png) background-size: contain --- background-image: url(img/scikit-learn.org.png) background-size: contain --- # Scikit-learn in numbers

 * Almost 1,100 contributors * Around 12 core contributors * Estimated cost of development: $ 2 millions * Estimated effort: 37 person-years --- # Scikit-learn in numbers  --- class: center, middle # Contributions to scikit-learn at Paris-Saclay --- # Loky ### Context: * Doctoral mission (Thomas Moreau) funded by DigiCosme. --- # Loky - `Thread` can be efficient to run multicore programs if your code releases the **GIL**. - `Process` for efficient multicore on pure Python code. -- - Use `Process` with `spawn` and try to reuse the pool of process as much as possible. - `loky` improves the management of a pool of workers in projects such as `joblib`. --- # Loky

--- # Loky ### Context: * Doctoral mission (Thomas Moreau) funded by DigiCosme. ### Outcome: * Important improvements in `joblib` and `scikit-learn`: e.g. enable nested parallelism, support of OpenMP * Contributions upstream to Python 3.7 * Experience for people involved in the development --- # Other contributions * `ColumnTransformer` * `CategoricalEncoder` * `TransformedTargetRegressor` * `StackingClassifier` and `StackingRegressor` * ... Aim: Features developed to enable baseline machine learning algorithms and pipeline --- class: center, middle # Open source development ## Collaboration and quality --- class: center, middle, bgheader background-image: url(img/workflow-issue.png) background-size: contain --- class: center, middle, bgheader background-image: url(img/workflow-pr.png) background-size: contain --- class: center, middle, bgheader background-image: url(img/workflow-pr-changed.png) background-size: contain --- class: center, middle, bgheader background-image: url(img/workflow-pr-review.png) background-size: contain --- class: center, middle, bgheader background-image: url(img/workflow-pr-ci.png) background-size: contain --- class: center, middle, bgheader background-image: url(img/workflow-pr-ci-travis.png) background-size: contain --- # Software engineering best practices Used in major open source projects: - Version control (git) - Bug tracker (GitHub) - Code review (GitHub) - Unit testing and Continuous Integration (Travis CI, Appveyor, ...) - Code quality (flake8, ...) - Documentation (sphinx) -- -> valuable skills learned during doctoral missions --- class: center, middle # Open source community --- # OSS: an under-estimated resource

--- # Scikit-learn* development Main bottleneck are maintainers * Reviewing pull requests * Deciding on design questions * Tackling bigger issues * Infrastructure work (CI, packaging, ...) High-quality open source software is not built on hackatons, but needs highly engaged developers
.grey[.right[.small[*The same is true for other open source packages]]] --- # How can you help? ### Engage with the community ### Contribute feedback ### Contribute code ### Employ developers / allow employees to contribute ### Fund --- .center[  ] ### Machine learning for everyone
--- from expert to beginners
* Users everywhere: academia, startups, large companies * Development focused on collaboration & quality * Always getting better: faster and easier * Sustainability challenging --- --- # An increasing user base ... .center[  ] ??? how many pandas users? website: 400,000 active users that come back every month also scope becomes bigger --- ## ... gives increasing maintenance cost .center[  ] --- # Pandas* development Main bottleneck are maintainers * Reviewing pull requests * Deciding on design questions * Tackling bigger issues * Infrastructure work (CI, packaging, ...) High-quality open source software is not built on hackatons, but needs highly engaged developers
.grey[.right[.small[*The same is true for other open source packages]]] ??? to be clear: hackatons and sprints are great to introduce people to contributing to open source, but that alone is not a sustainable way to maintain big open source projects + note: not to dismiss the many great packages that are actually build (initially) as an evening hack --- count: false # Pandas* development Main bottleneck are maintainers * Reviewing pull requests * Deciding on design questions * Tackling bigger issues * Infrastructure work (CI, packaging, ...) High-quality open source software is not built on hackatons, but needs highly engaged developers --- ## How can you help as an individual? --- # Engage with the community Community building - StackOverflow and mailing lists - Blogging and talks - Meet-ups and sprints Contribute feedback - Open bug reports and enhancement requests - Join discussions on the mailing list / issues [https://github.com/scikit-learn/scikit-learn/issues/](https://github.com/scikit-learn/scikit-learn/issues/) [https://mail.python.org/pipermail/scikit-learn/](https://mail.python.org/pipermail/scikit-learn/) --- # How can you help? ## Engage with the community ## Contribute code --- ## My first contribution to pandas -- count: false  -- count: false
### ... and now I am a pandas core dev --- ## How can you help as a company or institution? -- count: false ### Allow and encourage employees to engage with the community ### Contribute financially ### Employ open source developers (or let employees become one) ??? that doesn't need to be full time give employees that want it time to contribute to the packages they are using --- # Scikit-learn developers --- # Why engage in open source? ### Quality open source software does not come for free -- count: false ### Growth path for engineers ??? - This gives a new dimension to grow for your engineering. Other career perspective than becoming a manager. - Help you recruit people, show off the quality of your tech teams. -- count: false ### Better knowledge of the technical stack you are using --- # How can you help? ### Engage with the community ### Contribute feedback ### Contribute code ### Employ developers / allow employees to contribute ### Donate