Data Scientist Interview: Benjamin Root

来源:互联网 时间:1970-01-01


AtDataquest, we strive to help our users get a better sense of how data science works in industry as part of the data science educational process. We’ve started a series where we interview experienced data scientists. We highlight their stories, advice they have for budding data scientists, and the kinds of problems they’ve worked on. This is our second post in this series and is an interview with data scientist and engineerBenjamin Root.

Benjamin Root is a contributor to the Matplotlib data visualization library and focuses on improving documentation as well as the mplot3d toolkit within Matplotlib. Ben was previously a graduate student in Metereology where his experiences trying to work with data using MATLAB pushed him to learn Python and become further involved in the SciPy community.

How long have you been involved with the Matplotlib project and what pushed you to get involved?

Benjamin:I started using Python back in early 2009 when a couple of colleagues took note of some problems I was having with MATLAB and suggested that I give Python a try. With some example pylab scripts to learn from, I started to convert some of my .mfiles into .pyfiles. I had some pretty sophisticated MATLAB code, so my colleagues pointed me to the various mailing lists for NumPy, SciPy, and Matplotlib. With the feedback from those lists, I managed to convert all of my MATLAB code into Python, declaring that I would “never again” use MATLAB.

Well, it came time to do new development, and new visualizations, and I kept running into all sorts of nagging bugs and edge cases in NumPy and Matplotlib. I would report the bugs with example code and the developers would usually fix the bug. Eventually, the developers would start nudging me to where the problem existed and I started writing patches on my own. After about a year of submitting patches, John Hunter gave me commit rights. He later remarked that he gives out commit rights to anybody that “annoys” him enough. So, getting involved in Matplotlib came about due to annoying my colleagues, and then annoying the developers…

What does your involvement with Matplotlib entail? What kind of role do you have within the project?

Benjamin:Originally, my focus in Matplotlib was with documentation because I had “fresh eyes”, so it was easier for me to notice mistakes and rough edges. Now that I have read the documentation so many times, I don’t even see the mistakes anymore. By the way, this is why developers encourage newcomers to submit patches to the documentation!I was also trying to sand down the rough edges I kept encountering, particularly with the mplot3d toolkit packaged with Matplotlib. I submitted enough patches to mplot3d that I became its defacto maintainer.

Currently, my focus has been on doing reviews of pull requests on GitHub, along with diagnosing/verifying bug reports. I do still submit new features from time to time (in particular, look for the new “property cycling” and the “classic style” features in the upcoming v1.5 release). I participate in design discussions, trying to make sure that we make good design decisions early on for new features. I will also soon be taking on the role of maintainer of the Basemap toolkit. On the mailing list, I tend to be the one who responds to most of the questions from the newcomers, although the mailing list traffic has decreased substantially since the arrival of StackOverflow.

Many data scientist and data engineers rely on open-source libraries but are not sure how to contribute back. What’s your best advice for people can who want to get more involved with open source?

Benjamin:Those sorts of users are the best for libraries like Matplotlib. They are often the ones that are working at the boundaries of the library’s capabilities. So, it is very natural for those users to encounter bugs and limitations. File those bug reports! Ask questions on the mailing list, and give feedback and criticisms! Developers do want to hear back from their users because it means their work is being used!

Involvement in open source cannot be counted by the number of patches. Open source development is so much more than just code and documentation. Software does not spontaneously “come about” from the ether. Software fulfills a need and it is the users who defines that need. I personally believe that the most valuable contributors to an open source project are the ones who gives feedback and criticisms to the developers. Without them, the project stagnates and dies from apathy.

So, my challenge to your readers is to figure out which three libraries and tools they directly use the most and subscribe to their user mailing list. Then, the next time “something weird” or “unexpected” happens when using one of those tools, don’t brush it off as your own fault, or a limitation that you have to put up with. Send an email to that list asking if anybody else thinks it is weird or unexpected. Press for answers. Encourage discussions. Finally, if a developer asks for feedback on something, take them up on their request.

And please, for the love of all things good, pleasetry out your software on the beta and release candidates of those libraries and report problems!

You published a book recently called Interactive Applications using Matplotlib. What was the purpose of writing this book and who is your target audience?

Benjamin:Mostly, I wanted to improve my own knowledge in this part of Matplotlib, while also producing better documentation for it. Interactivity is such an important feature of Matplotlib, but it is documented so poorly. All examples and tutorials that I could find online only demonstrated various aspects, and they were so disjointed from each other. There was no narrative that would help a user buildtheir understanding from a strong foundation. In the book, rather than making all of the example code completely independent of each other, we build a single application piece-by-piece.

One thing I really like about this narrative approach was a particular moment of satisfaction I had in the widgets chapter where we add a slider and then I point out to the reader that the new slider was automatically tied into the keyboard shortcuts and some other buttons that we did in an earlier chapter. That just simply would not have been possible in your typical example-driven documentation where the demonstration for the keyboard shortcuts would have been completely unrelated to the demonstration for a widget.

My target audience for the book is for those who have some experience with Python and object oriented programming and has a need to produce interactive visualizations in Python. You don’t need any prior experience with Matplotlib or any other Scientific Python tools, but I also do not spend much time explaining the possible visualizations and how to customize them. There are plenty of tutorials about that. I assume that you already can display a plot the way you want, but need to build tools that interacts with that plot.

How did your experience as a graduate student in meteorology shape the work you’re doing now?

Benjamin:Meteorology, like many other sciences, is very much a visual experience. Viewing data as text is a terrible way to understand your data. So, a graduate student must be able to create their own visualizations of their work. There are several tools for that, some are very specific to the atmospheric sciences like NCL and GEMPAK, while others are more general such as the MATLAB language.

As a graduate student, I was constantly trying new things that aren’t established, and I am using tools in ways that weren’t thought of before. Therefore, I kept running into bugs in those tools. What really pushed me over to Python from the MATLAB world was the ability to fixthose bugs. In addition, there are some features that are now in Matplotlib that were fostered from a meteorology perspective. For example, I often need to visualize my data with very accurate maps, so I have taken a particular interest in projects such as Basemap and Cartopy, making sure they satisfy my needs. Streamplots and wind barbs are a couple other features that we have cultivated in Matplotlib for meteorologists.

What are your thoughts on the emerging data visualization toolkit, Vispy?

Benjamin:Vispy is an amazing project that shows great promise. One of its original components, glumpy, was actually created at my encouragement due to the limitations of the mplot3d toolkit shipped with Matplotlib. Several of the original components of Vispy were created at the behest of Matplotlib developers because we saw the value of performing visualizations on the GPU, but we did not have enough expertise, nor did we have the bandwidth to pursue such a project while also maintaining Matplotlib. We also did not feel that we could add such experimental features into Matplotlib without causing significant disruption to our users. So, we encouraged some developers who were coming to the mailing list with these fantastic ideas to start from a clean slate and make their ideas a reality, even if it was just a proof of concept. Those projects and a few more then merged into Vispy.

At the 2015 SciPy conference, the Matplotlib developer team had the excellent opportunity to talk in-depth with some of the Vispy developers about where our projects stand in the Python ecosystem, what are our respective roadmaps, and what the future holds for our two projects. It was a very exciting discussion. While we don’t have any long-term plans yet, there are definitely some more immediate goals, such as Vispy using Matplotlib’s text-rendering code to improve the visual quality of the plots. We will also be collaborating with the Vispy developers to create some proof-of-concepts for hooking Vispy into Matplotlib as a backend. The future of data visualization in Python is looking very bright, indeed!