Installing Python for beginners

Students really struggle with setup. By the time they’ve finished setting up Python, Jupyter Notebooks, etc., they’re ready to quit the course and not even learn Python at all — especially students using Windows.

I think with Miniconda I’ve finally tamed that beast. Here are my instructions for students, in one Google doc. Feel free to copy and edit it for your own use.

http://bit.ly/mm-conda

Python, data work, and O’Reilly books

I own many O’Reilly books about code. I’m kind of mad that they quit selling PDFs, because I loved those PDFs for searchability, and the Kindle editions are nowhere near as good (they have layout issues that don’t occur in PDFs).

Recently, though, I bought a hardcopy of Python Data Science Handbook, and this inspired me to examine my O’Reilly Python library.

First, a bit about Python Data Science Handbook: It’s a large book, 530 pages, but it has only five chapters:

  1. “iPython: Beyond Normal Python” (all the stuff you can do with the iPython shell, which is different from Jupyter Notebooks)
  2. Intro to NumPy
  3. Pandas
  4. Matplotlib
  5. Machine learning

That list is exactly why I bought this book, even though I already owned others. (See the whole book online.) I especially want to learn more about using Matplotlib in a Jupyter Notebook.

After reading chapters 1 and 2, I went into my older O’Reilly PDFs to see what other Python books I have in that collection. I opened Data Wrangling with Python and ended up spending more time in it than I’d expected, because — surprise! — not only is it completely different from Python Data Science Handbook; it is all about the kinds of things journalists use Python for the most: web scraping, document management, data cleaning. I don’t know why I’ve never spent more time with that book! (See the table of contents.) The first two chapters explain the Python language for beginners, and then it goes on to data types (CSV, JSON, XML) that you need to know about when dealing with data provided by government agencies and the like. There’s a whole chapter on working with PDFs.

The big downside to Data Wrangling with Python is that the examples and code are Python 2.7. I understand why the authors made that choice in 2015, but now it’s a detriment, as those old 2.7 libraries are no longer being maintained. You can still learn from this book, and if you’re a bit experienced with Python and the differences between 2.x and 3.x, it should be easy to work around any issues caused by the 2.7 code.

Another criticism I’d offer about Wrangling is that the chapter “Data Exploration and Analysis” uses agate, a Python library designed for journalists, but in 2019 Pandas (another Python library) would be a much better choice.

I’ve been teaching web scraping with Python to journalism students for four years now, and I’ve used a different O’Reilly book, Web Scraping with Python, by Ryan Mitchell, since the beginning. An updated second edition of Mitchell’s book came out last year, updating from 2.x to 3.x, which is good. (See the table of contents.)

I have several other Python books (including some not from O’Reilly), but as I’m focused here on dealing with data issues (analysis and charts as well as scraping and documents), there’s only one other book I’d like to include in this post. It’s actually not a Python book, but it is from O’Reilly: Doing Data Science, by Schutt and O’Neil. (See the table of contents.) It’s older (published in 2013), but I think it holds up as an introduction to data analysis, algorithms, etc. It even has a chapter titled “Social Networks and Data Journalism.” Charts are in color, which I like very much. There’s not a lot of code in the book — it’s not about showing us how to write the code — and examples are in several languages, including Python, R, and Go.

All four books referenced here are distinctly different from one another. Although there is some overlap, it’s minimal.

(This post was edited in November 2019. After a recent closer reading of several chapters in the first edition of Data Wrangling with Python, I have concluded that it really needs an update, and much of it cannot be comfortably used with today’s libraries.)

Fixing Jupyter Notebook startup error

I previously wrote about Jupyter Notebooks here.

Before the MacOS Sierra 10.12.5 update, to launch Jupyter in your browser, you only needed to do this:

  1. Open Terminal.
  2. Enter the desired directory.
  3. Activate your virtualenv (if using one).
  4. At the command line, type:
jupyter notebook

A new window would open in your default web browser, and there were all your Jupyter files.

But after the MacOS update, instead we got an execution error: localhost doesn’t understand the “open location” message. And the browser window did not open.

The way to fix it (do it once) is here.

Note: In the config file, I had to search for 

#c.NotebookApp.browser =

instead of the string the author provided.

Jupyter Notebooks

To install: Install (instructions at jupyter.org). Note that pip or pip3 install works. It’s on that page, after the Anaconda part.

You do not have to use Anaconda, which installs a lot of extra things.

You can install into a virtualenv.

After installing (with virtualenv activated, if you installed it that way), in Terminal, at bash prompt:

jupyter notebook

On Mac OS and Chrome, a new browser tab opens automatically, and you’re in the same folder where you were in the Terminal. If you have someone else’s notebook or a folder full of notebooks, you can toss them into that folder using Finder, just like any files.

Screenshot: Jupyter Notebook

Above: Two folders and one notebook file. Below: A notebook, open for work.

Screenshot: Jupyter Notebook

Create a new notebook file: The New button is on the far right side.

The thing I find hard to remember: You have to press Shift-Return to run the code in one of the boxes, or to save markdown you wrote. On the Cell menu, there’s an option to Run All.

Menus and icons: Very self-explanatory. Explore them.

File menu: “Revert to checkpoint” lets you roll back to the previous save.

File menu: “Close and Halt.” Saves the current notebook file and closes it.

To quit Jupyter, go to the Terminal and Control-C (not Command).