SciPy Tips & Snippets

This is a live version of my personal SciPy Tips & Snippets file. I wrote this for myself as a reference when learning how to use SciPy, and I thought that it might be helpful for others.

Initial setup for IPython notebooks

import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline

This is the minimum of what is needed. For more, see ipython-setup.

You also might want some notebook extensions:

pip install jupyter_contrib_nbextensions jupyter contrib nbextension install --user

Pandas DataFrames

Assume there is a DataFrame called df.

Importing data

To get this DataFrame, you can import from…

  • CSV: df = pd.read_csv("path/to/csv.csv")
  • Stata: df = pd.read_stata('path/to/stata_data.dta')
  • Excel: df = pd.read_excel('/path/to/excel.xls')
    • You will need to first pip install xlrd and then import xlrd for this to work.

Reading from MySQL

from sqlalchemy import create_engine

from sqlalchemy import create_engine engine = create_engine('mysql://root:password@localhost/database') sql = "select * from tablename" df = pd.read_sql(sql, engine)

Empty dataframes

If you want to make an empty dataframe and then fill it in, do this:

df = pd.DataFrame() df = df.append(pd.DataFrame({ "column1": [1], "column2": [2] }), ignore_index = True)

That will add one row to df where column1 = 1 and column2 = 2.

Exporting data

  • To CSV: df.to_csv(file_name, encoding='utf-8')
  • To Stata: df.to_stata("/path/to/stata/file.dta")

Tips for exporting to Stata

Pandas is fussy about df.to_stata(...). Here are some workarounds for common problems:

  • Exporting datetime columns: df.to_stata('f.dta', convert_dates={'datetime_col_name': 'tc'})
    • See the docs for a full list of alternatives to tc.
  • “Writing general object arrays is not supported” error:
    • Figure out which of your columns are of object type: list(df.select_dtypes(include=['object']).columns)
    • Convert them to something else: df['col'] = df['col'].astype(str)
    • This code may help:

      # Attempt to auto-convert columns from `object` to a useful type. # Note that string columns may still have `dtype=object`, but will still be able # to export to Stata format. type_pref = [int, float, str] for colname in list(df.select_dtypes(include=['object']).columns): for t in type_pref: try: df[colname] = df[colname].astype(t) except (ValueError, TypeError) as e: pass

Feather

feather (GitHub) is a “fast on-disk format for data frames.” It is similar to pickle but for data frames.

After pip install feather-format, you can:

# Writing to a Feather file import feather path = 'my_data.feather' feather.write_dataframe(df, path) # Reading from a Feather file df = feather.read_dataframe(path)

For long-running reads from external sources, this pattern might be helpful:

import hashlib # for md5 checksum import os.path # for checking if file exists data_source = "data/source.xlsx" md5 = hashlib.md5(data_source).hexdigest() data_source_feather = "data/source-%s.feather" % md5 if os.path.isfile(data_source_feather): print("Loading %s..." % data_source_feather) df = feather.read_dataframe(data_source_feather) else: df = pd.read_excel(data_source) feather.write_dataframe(df, data_source_feather)

The first time this code runs, it will be slow as Pandas reads from the original data source. Subsequent runs should be very fast.

DataFrame information

  • Get dimensions: df.shape
  • Get list of columns: df.columns

Selecting columns and rows

  • Select columns: df[['col1', 'col2', ..]]
  • Selecting rows:
    • df.loc[...] is used to select based on labels in the index
    • df.iloc[...] is used to select based on integer positions in the index
    • df.ix[...] tries to use labels, and falls back to positions if the label is not in the index. Only use this when you need to mix label and positions (like for selecting rows based on label and columns based on position).
  • Selecting columns and filtering rows at the same time:
    • df[['col1', 'col2', ..]][df['col1'] == "criteria"]
    • If you want to modify cells, you have to use loc, iloc, or ix to avoid this warning: “A value is trying to be set on a copy of a slice from a DataFrame”.
      • Syntax: df.ix[df['column_name'] < 1, ['column_to_change']] = 2
  • Selecting a specific cell by row and column name: df.loc['row name', 'col name']

Selecting rows with a MultiIndex

Here are some examples of how to select specific rows with a two-level index (also known as a hierarchical index or a MultiIndex):

In [1]: import pandas as pd In [2]: df = pd.DataFrame({("bar", "a"): {"col1": 1}, ("bar", "b"): {"col1": 2}, ("baz", "a"): {"col1": 3}, ("baz", "b"): {"col1": 4}}).T In [3]: df Out[3]: col1 bar a 1 b 2 baz a 3 b 4 # Select a specific row In [4]: df.loc[("bar", "a"), :] Out[4]: col1 1 Name: (bar, a), dtype: int64 # Select all rows with a specific top-level index In [5]: df.loc["bar", :] Out[5]: col1 a 1 b 2 # Select all rows with a specific second-level index In [6]: df.loc[(slice(None), "a"), :] Out[6]: col1 bar a 1 baz a 3

Renaming columns and rows

  • Rename columns: df.columns = ["col 1 name", ...]
  • Renaming a single column: df.rename(columns={'oldname': 'newname'}, inplace=True)
  • Renaming rows: df = df.set_index('column name'), which will cause the rows to be labeled with the values of the specified column. You may have to create a new column first to get what you want.
  • Removing the label for an index (this is the label that appears above row labels, which I think is confusing): df.index.names = [None]
  • Round everything to 1 decimal place: np.round(df, decimals=1)

Deleting columns and rows

  • Delete a column: df = df.drop('column_name', 1)

Recoding data

  • Based on value: df['var'] = df['var'].replace('oldval', 'newval')

Subsetting data

  • Unique values from a column: np.unique(df['colname']) or df.column.unique()
  • Filtering rows using a list of values: df[df['col'].isin(whitelist)] will yield all the rows in df where the value for col appears in whitelist = pd.Series(...)

Aggregating data

  • I often find myself wanting to sum a variable by group. For example, I might want the number of records for each day in the dataset.

    This can be accomplished easily with the following Stata code:

    gen one = 1 collapse (sum) one, by(datevar)

    Translated to Python, this is:

    df['one'] = 1 collapsed = df.groupby(['date'])['one'].agg(np.sum)

    If you aggregate by a variable with a datetime type, and you want to access specific values, you can use collapsed[pd.datetime(2016,1,1).date()].

  • You can aggregate with a custom method like this:

    df.groupby('group_by_this_column_name')['aggregate_this_column_name'].agg({'output_column_name': custom_method})

    The custom method should have one argument, which is a Pandas Series object containing the values of aggregate_this_column_name for a given value of group_by_this_column_name.

  • You can aggregate multiple columns with different methods like this:

    aggregate = ['aggregate1', 'aggregate2'] df.groupby('group_by_this_column_name')[aggregate].agg({ 'aggregate1': np.sum, 'aggregate2': (lambda x: x.iloc[0]) # Gets first value })
  • If you want to run multiple aggregation methods for the same column or just show more descriptive names, you can do this:

    aggregate = ['aggregate1', 'aggregate2'] df.groupby('group_by_this_column_name')[aggregate].agg({ 'aggregate1': {'sum': np.sum}, 'aggregate2': {'sum': np.sum, 'first_value': (lambda x: x.iloc[0])} # Gets first value })
  • Adding a running count column by group (first row for a given groupvar will be 0, second will be 1, etc.):

    df['running_count'] = df.groupby(['groupvar']).cumcount()
  • Adding a total count column by group (if there are 3 rows with a given groupvar, all 3 rows will have total_counts set to 3):

    total_counts = df.groupby(['groupvar']).size() df = df.set_index('groupvar') df['total_counts'] = total_counts df = df.reset_index()

Other data manipulation

  • Concatenate two columns (a string and integer): df['bar'] = df['bar'].str.cat(df['foo'].values.astype(str), sep=' is ') (source)
  • Change type of column: df['col'] = df['col'].astype(str)
    • For changing to numeric, this works on recent versions of Pandas: df['col'] = pd.to_numeric(df['col'])
  • Changing the case of a string column: df['col'] = df['col'].str.upper() or df['col'].apply(lambda x: x.upper(), inplace=True) if you need to deal with unicode strings. Use lower() to lowercase rather than uppercase.
  • Iterating over rows in a DataFrame: for index, row in df.iterrows():
  • Sorting: df.sort_values(['col1', 'col2'], ascending=[True, False], inplace=True)

Getting the previous row value for long longitudinal data

Sometimes it’s helpful to get a value from row i-1 in row i with long data. This would fill in the prev_value column in the table below.

id obs_num value prev_value
1 1 7 NaN
1 2 10 7
1 3 18 10
2 1 3 NaN

To do this:

df = pd.DataFrame([ {'id': 1, 'obs_num': 1, 'value': 7}, {'id': 1, 'obs_num': 2, 'value': 10}, {'id': 1, 'obs_num': 3, 'value': 18}, {'id': 2, 'obs_num': 1, 'value': 3}, ]) df['prev_value'] = df.groupby('id')['value'].shift() # In : df # Out: # id obs_num value prev_value # 0 1 1 7 NaN # 1 1 2 10 7.0 # 2 1 3 18 10.0 # 3 2 1 3 NaN

Missing data

  • Get rows where a numeric column is missing: df[pd.isnull(df['colname'])]
  • pd.notnull can be used in the same way for the opposite effect
  • The fillna() method can be used to fill missing data.
    • Fill with 0s: df['col1'] = df['col1'].fillna(0)
    • Fill with previous value: df['col1'] = df['col1'].fillna(method='ffill')

Duplicate data

  • Dropping duplicates
    • df = df.drop_duplicates(['col_with_duplicates'], keep='first')
    • Options for keep= are first, last, and False (which drops all duplicates)
  • Counting duplicates of the combination of col1 and col2

    counts = df.groupby(['col1', 'col2']).size().reset_index().rename(columns={0:'count'}) print counts[counts['count'] > 1]

Merging data

Appending data frames

  • df_both = df.append(df2)

Recoding variables

  • df['recoded'] = df['original'].map({"original_value_1": "new1", "original_value_2": "new2"})

Miscellaneous

  • df.T will transpose the DataFrame

Scipy

  • Truncated normal distribution:

    from scipy import stats # Set up a truncated normal distribution lower = 0.2 upper = 0.8 mu = 0.5 sigma = 1 distribution = stats.truncnorm((lower - mu) / sigma, (upper - mu) / sigma, loc=mu, scale=sigma) # Draw once from distribution draw = distribution.rvs(1)[0]

Statistics

  • Frequencies table for a column: df['column_name'].value_counts()
  • Descriptive statistics for a column: df['column_name'].describe()
  • Two-way tables:

    grouped = df.groupby(['var1', 'var2']) grouped.size() # Output: # var1 var2 # value1 value1 3 # value1 value2 1 # value3 1

Graphing

  • Must-read explanation of matplotlib – read this before you start making graphs.
  • Basic setup:

    import matplotlib.pyplot as plt # http://matplotlib.org/faq/usage_faq.html fig = plt.figure(figsize=(6*1.5, 4*1.5)) ax = fig.add_subplot(1,1,1) ax.hist(df['col'], color="k", alpha=0.4)
  • Axis titles: plt.ylabel("Y axis title here")

  • Plot title: ax.set_title('Plot title here')

  • Axis range: ax.set_xlim([min, max])

  • Setting figure size: fig = plt.figure(figsize=(6*1.5, 4*1.5))

  • Histograms with evenly spaced bins for integers: ax.hist(df['col'], color="k", alpha=0.4, bins=range(0, max(df['col']) + 10, 10))

    • 10 is the bin width
    • More info is available on StackOverflow, including how to do this with floats
  • Use GridSpec for easy multi-graph figures, where graphs have different sizes.

    For example:

    import matplotlib.gridspec as gs # http://bit.ly/1U6YKGm fig = plt.figure(figsize=(10, 8)) gs = gridspec.GridSpec(2, 2, height_ratios=[3, 1]) ax = plt.subplot(gs[0]) ax.hist(...) ...

    This is for a 2x2 grid of graphs. The top two graphs will be taller than the bottom two.

    See this blog post for more information and full example code.

  • Common axis labels for plots with subplots:

    fig.text(0.5, 0.04, 'common xlabel', ha='center', va='center') fig.text(0.06, 0.5, 'common ylabel', ha='center', va='center', rotation='vertical')
  • Add more horizontal padding between rows of subplots: fig.subplots_adjust(hspace=0.35)

  • Saving a figure to a PDF

    from matplotlib.backends.backend_pdf import PdfPages ppp = PdfPages('/path/to/file.pdf') ppp.savefig(fig) ppp.close()
  • Keep a plot from automatically showing in a notebook: plt.close()

IPython notebooks

  • Line magics” (like %load and %run) – I definitely recommend reading through the official docs or this blog post
    • %pdb will open an interactive prompt when there’s an exception
  • Have None be the last line in the cell to avoid junk output like Out[1]: <matplotlib.text.Text at 0x115933850>
  • Want to automatically open Chrome when you run jupyter notebook? Add this to /Users/you/.jupyter/jupyter_notebook_config.py: c.NotebookApp.browser = u'/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome %s'. If you don’t have that config file, jupyter notebook --generate-config will generate it.

General Python stuff that might be useful

  • Explanation of import, etc.
  • Static, class, and abstract methods
  • Merging dicts
  • Magic methods
  • Iterating over a dict: for k, v in {'key': 'value'}.iteritems()
  • Print rounded float as a string with 3 decimals: "%.3f" % float_var
  • Save file to string:

    with open("Output.txt", "w") as text_file: text_file.write("Purchase Amount: %s" % TotalAmount)
  • Intermediate Python book

  • Regular expressions:

    • Check for match: if re.compile("^.*$").match(str):
    • Quick way to get match out of string: re.compile("regexp(.*)").search(str).groups()[0]
  • Interactive breakpoint (works when running python myprogram.py):

    import pdb # at top of file pdb.set_trace() # where you want the breakpoint
    • If you want syntax highlighting in the debugger and tab completion, use pdb++, which is a drop-in replacement for pdb. You literally pip install pdbpp and change nothing else.
  • Colors in stack traces when you run python somefile.py in the Terminal:

    • First, pip install colored-traceback
    • Then, add this code to your Python file:

      try: import colored_traceback.auto except ImportError: pass
  • For debugging, awesome_print can be very helpful.

    # First, `pip install awesome_print` from awesome_print import ap ap({"test": [1,2,3]})
  • Raising a generic exception: raise Exception("message")

  • Getting a datetime with the correct local timezone (Python 2):

    import datetime as dt from tzlocal import get_localzone from pytz import timezone dt.datetime.now(timezone("UTC")).astimezone(get_localzone()).strftime("%Y-%m-%dT%H:%M:%S %Z")

Logging

The basic logging tutorial in the official Python docs is a good place to start. Here’s an example from there:

import logging logging.basicConfig(filename='example.log',level=logging.DEBUG) logging.debug('This message should go to the log file') logging.info('So should this') logging.warning('And this, too')

Testing

Python comes with built-in testing capabilities.

import unittest class TestSomething(unittest.TestCase): def setUp(self): # Runs before each test method pass def tearDown(self): # Post-test clean-up, run after each test method pass def test_a_thing: assertEqual(1, 1)

There are a bunch of built-in assertSomething methods:

assertDictEqual, assertEqual, assertTrue, assertFalse, assertGreater, assertGreaterEqual, assertIn, assertIs, assertIsIntance, assertIsNone, assertIsNot, assertIsNotNone, assertItemsEqual, assertLess, assertLessEqual, assertListEqual, assertMultiLineEqual, assertNotAlmostEqual, assertNotEqual, assertTupleEqual, assertRaises, assertRaisesRegexp, assertRegexpMatches

Tests can be skipped with the @unittest.skip("message here") decorator.

Tests can automatically be run once with nosetest -v (put your tests in tests/, start each filename with test_, and each test method with test_).

To get some color in the test output, pip install pinocchio and then run nosetests -v --spec-color --with-spec.

I’m also trying out nose-progressive for colorizing nosetests tracebacks: pip install nose-progressive and then nosetests -v --with-progressive --logging-clear-handlers. This is not compatible with pinocchio, and it seems to crash sniffer.

To automatically run tests when a file changes, pip install sniffer and then run sniffer -x--spec-color -x--with-spec or sniffer -x--with-progressive.

To get a debugger to run from within a test, pip install nose2 and then run nose2. If you pdb.set_trace() in a test, the debugger will open.

Tools and libraries of interest

  • Bokeh is a library for interactive visualizations.
  • Rodeo is a data science IDE for Python.
  • Seaborn is a high level Python visualization library based on matplotlib.
  • Plot.ly is an web application for making charts with a Python API.
  • Records for accessing data in MySQL
  • ftfy for fixing text encoding

Installing Recent Python on OS X

I used to use these instructions to set up pyenv, but pyenv can cause some problems. Handling Python dependency requirements is difficult enough already that I don’t want any extra complexity on top of pip (the package manager) and virtualenv (separate environments with different installed dependencies).

So here is what I’m doing now:

  • Install Python 2.x.x from Homebrew to get an up-to-date version: brew install python --enable-framework (the --enable-framework is an attempt to get around this numpy issue, but it doesn’t appear to work)
  • Install virtualenv with brew install virtualenv.
  • Create a folder called ~/.virtualenvs to store all my virtualenvs (each one is a subfolder).

    It doesn’t really matter what I call this, but putting it in ~/ (my user’s home folder) is convenient and using a .something folder will hide it in Finder.

  • Add the following to my ~/.zshrc file (or ~/.bashrc if I didn’t use zsh):

    # pip should only run if there is a virtualenv currently activated export PIP_REQUIRE_VIRTUALENV=true gpip(){ PIP_REQUIRE_VIRTUALENV="" pip "$@" } pyenv() { source ~/.virtualenvs/$@/bin/activate } newpyenv() { cd ~/.virtualenvs/ && virtualenv $@ && cd - && source ~/.virtualenvs/$@/bin/activate }
    • This will prevent pip install ... from working if I’m not in a virtualenv, which prevents me from accidentally installing anything globally (I do this all the time).
    • To get around this restriction, use gpip install ...
    • To activate a virtualenv named gorbypuff, run pyenv gorbypuff
    • To create this virtualenv and activate it, run newpyenv gorbypuff
    • Note that I use pyenv and newpyenv for these function names – and this is not the same as the pyenv that I mentioned ditching at the beginning of this section. I just find it easy to remember pyenv so that’s what I use.
  • Set up virtualenv using system Python so numpy will work: virtualenv -p /usr/bin/python2.7 ~/.virtualenvs/data-python2.7sys (this doesn’t seem to work either)

  • Update pip: pip install --upgrade pip

  • Install the scipy stack:

    pip install numpy scipy matplotlib ipython jupyter pandas sympy nose
  • Install some additional packages I often use:

    pip install nose-progressive pdbpp colored-traceback awesome_print

Resources

Thanks

The following people have helped with this:

Suggestions? Corrections?

Please contact me.