Python Handbook: Using Python

Python and statistical analysis

Python is a programming language that is free to install, and is available for Windows, Linux, Mac, and other operating systems. It is increasingly used for data analysis and statistical procedures.

Base Python is a general purpose programming language, and so isn’t designed specifically for conducting statistical analysis. Luckily there are a variety of packages and libraries designed for handling data and conducting statistical analyses.

The package pandas can be used to create and manipulate data frames. Numpy is used to perform mathematical operations on arrays and matrixes. SciPy conducts common statistical analyses, and statsmodels fits common statistical models and analyses using pandas data frames. There are many other packages available for specific purposes.

Getting Started with Python

It can be a little cumbersome for a beginner to get Python installed, install an Integrated Development Environment (IDE), and install the libraries you want to use.

One approach is to install a software package that includes an IDE and the packages and libraries you are likely to need. Anaconda and WinPython are two software packages that are relatively easy to install and allow the user to get started quickly.

Portable installation

WinPython allows for portable installation on Windows machines. This means that the entirety of the software can be installed in a specified folder or on an external usb drive. At the time of writing, an installation of WinPython used about 5 gigabytes of space.

Using Python online

There are websites on which you can run Python in an online environment. Unfortunately, at the time of writing, many do not allow for the importation of packages that are used for statistical analysis.

One site I’ve found that is easy to use and allows for the easy importation of common packages is replit (replit.com/languages/python3). At the time of writing, it appears that you do need to create an account to create and run Python code. Common packages can be imported with a simple import call in the code.

Integrated Development Environments

Most users will want to run Python scripts using an integrated development environment (IDE). There are a variety of IDE’s available for use with Python. IDE’s supply the user with an interface where it is easy to write, debug, and run code, and to view the results.

There are several common IDE’s used with Python, and there are strong supporters of each. Common IDE’s include Spyder, PyCharm, VS Code, and RStudio.

Using the Spyder environment

In my opinion, Spyder is a good choice for beginners. Conveniently, it is included with both Anaconda and WinPython.

In Spyder, program code can be written and edited in the upper left Editor pane. Code in the Editor pane can be run in small chunks by selecting the code and using the Run selected button. Code can be saved as an .py file, and those files can subsequently be opened with Spyder.

Code can be pasted into the Editor pane from a text file or website, and .py files can be opened with a plain text editor.

The results are reported in the Console pane on the lower right.

The upper right pane has a small sub menu, at the bottom of that pane, that allows the user to choose if the pane displays Help, Variable Explorer, Plots, or Files. You may need to select Plots to see the plots produced by the Python script. And you may need to enable displaying plots with View > Panes > Plots.

Plots can be saved as .png files, but you may want to use code to output plots as a .pdf file for better resolution.

Packages and Libraries

The terms “package” and “library” in Python can be confusing. In Python lingo, a package is a collection of modules—and modules are a collection of functions and associated items. A library is just a collection of packages. For example, this book will use the package stats, which is contained in the SciPy library.

Package installation and updating

If packages are already installed, they can be imported simply with calls in the Python code like:

import pandas as pd

import scipy.stats as stats

The installation of additional packages and libraries will depend on how Python and the IDE was installed.

WinPython

For WinPython, to install additional packages you can use WinPython Command Prompt.exe in the installed WinPython folder, and call:

pip install packagename

where packagename is the name of the package you wish to install.

You can also update included packages using WinPython Command Prompt.exe:

pip install --upgrade packagename

Anaconda

Anaconda has its own package installation call, from the terminal. On Windows, this may be called Anaconda Prompt, and may be found in the Programs > Anaconda folder.

conda install packagename

And you can also use pip install.

Other installations

For other installations in Windows, the Windows terminal is used.

Package versions

The Python version can be displayed with the sys package.

import sys

print(sys.version)

3.12.4

Many packages have the __version__ attribute, which contains the version of the package.

import pandas as pd

print(pd.__version__)

2.2.2

import scipy

print(scipy.__version__)

1.13.1

Inputting Data and Using Installed Packages

The following examples will show options for inputting data in Python and will use some common packages.

For these examples, your Python installation should have access to io, pandas, and SciPi.

Inputting data as lists and creating a pandas data frame

This example creates variables Score and Student as lists and then combines them into a pandas data frame called Data.

pandas .describe() is used to summarize the data frame by Student.

A Kruskal–Wallis test is then conducted with SciPy stats. At this point, you don’t need to worry about why you might conduct a Kruskal–Wallis test or what the results mean. The point is just to test that you can use the SciPy library, and see how data in a data frame can be manipulated to pass to the kruskal function as an example.

Remember to run only the blue (and green) code. The purple code is the (truncated) output Python should produce. Code in green are comments that can be run, but don’t produce output.

import pandas as pd

Score = [10, 9, 8, 7, 7, 8, 9, 10, 6, 5, 4, 9, 10, 9, 10]

Student = ["Bugs", "Bugs", "Bugs", "Bugs","Bugs",

"Daffy", "Daffy", "Daffy", "Daffy", "Daffy",

"Taz", "Taz", "Taz", "Taz", "Taz"]

Data = pd.DataFrame(

{'Student': Student,

'Score': Score})

print(Data)

Student Score

0 Bugs 10

1 Bugs 9

2 Bugs 8

3 Bugs 7

4 Bugs 7

5 Daffy 8

6 Daffy 9

7 Daffy 10

8 Daffy 6

9 Daffy 5

10 Taz 4

11 Taz 9

12 Taz 10

13 Taz 9

14 Taz 10

print(Data.info())

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Student 15 non-null object

1 Score 15 non-null int64

Summary = Data.groupby('Student').describe()

print(Summary)

Score

count mean std min 25% 50% 75% max

Student

Bugs 5.0 8.2 1.303840 7.0 7.0 8.0 9.0 10.0

Daffy 5.0 7.6 2.073644 5.0 6.0 8.0 9.0 10.0

Taz 5.0 8.4 2.509980 4.0 9.0 9.0 10.0 10.0

### Define BugsScore as the Score for Bugs, and so on

BugsScore = Data['Score'][Data['Student']=="Bugs"]

DaffyScore = Data['Score'][Data['Student']=="Daffy"]

TazScore = Data['Score'][Data['Student']=="Taz"]

print(BugsScore)

0 10

1 9

2 8

3 7

4 7

### Conduct Kruskal-Wallis test

from scipy.stats import kruskal

ChiSquare, pValue = kruskal(BugsScore, DaffyScore, TazScore)

print("Chi-square:", round(ChiSquare, 4))

print("p-value:", round(pValue, 4))

Chi-square: 0.8483

p-value: 0.6543

Inputting data as data lines and creating a pandas data frame

The following example reads a data frame from a text object using io. The result is a pandas data frame.

import io

import pandas as pd

TwoCats = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""

Cat Score

"Leo Verdura" 8

"Katya" 10

"Katya" 9

"Katya" 8

"Katya" 7

"""))

print(TwoCats.info())

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Cat 12 non-null object

1 Score 12 non-null int64

Summary = TwoCats.groupby('Cat').describe()

print(Summary)

Score

count mean std min 25% 50% 75% max

Cat

Katya 6.0 8.333333 1.21106 7.0 7.25 8.5 9.0 10.0

Leo Verdura 6.0 8.000000 0.00000 8.0 8.00 8.0 8.0 8.0

Importing data as a .csv file

The following example reads a data frame, TwoTowns, a .csv file, from the internet. This is hypothetical data.

You can also read .csv files from a local file on your computer.

import pandas as pd

TwoTowns = pd.read_csv('http://rcompanion.org/documents/TwoTowns.csv')

print(TwoTowns)

Town Income

0 Town.A 46284

1 Town.A 54467

2 Town.A 44532

3 Town.A 41384

4 Town.A 39621

.. ... ...

197 Town.B 99372

198 Town.B 108216

199 Town.B 38107

200 Town.B 40726

201 Town.B 33557

print(TwoTowns.info())

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Town 202 non-null object

1 Income 202 non-null int64

Summary = TwoTowns.groupby('Town').describe()

print(Summary)

Income ...

count mean std ... 50% 75% max

Town ...

Town.A 101.0 48146.425743 10851.668480 ... 48007.0 56421.0 77774.0

Town.B 101.0 115275.217822 163878.164535 ... 47223.0 108216.0 880027.0

Because the output is little difficult to read, and some collumns are omitted, we can clean up the output.

One option is to increase the collumns that will be displayed, and round the values.

pd.set_option('display.max_columns', None)

print(round(Summary))

pd.reset_option('display.max_columns')

Income

count mean std min 25% 50% 75% max

Town

Town.A 101.0 48146.0 10852.0 23557.0 40971.0 48007.0 56421.0 77774.0

Town.B 101.0 115275.0 163878.0 29050.0 34142.0 47223.0 108216.0 880027.0

For this data, another option is convert the data frame to integer values.

SummaryInt = Summary.astype('int')

print(SummaryInt)

Income

count mean std min 25% 50% 75% max

Town

Town.A 101 48146 10851 23557 40971 48007 56421 77774

Town.B 101 115275 163878 29050 34142 47223 108216 880027

The following example reads a data frame, Greenwich, which is a .csv file, from the internet. This file contains river stage measurements for Greenwich, NJ.

import pandas as pd

Greenwich = pd.read_csv('http://rcompanion.org/documents/Greenwich.csv')

print(Greenwich)

Agency Station Parameter TS_ID Year Month Stage

0 USGS 1413038 65 97255 2000 4 0.297

1 USGS 1413038 65 97255 2000 5 0.219

2 USGS 1413038 65 97255 2000 6 0.055

3 USGS 1413038 65 97255 2000 7 0.260

4 USGS 1413038 65 97255 2000 8 0.293

.. ... ... ... ... ... ... ...

166 USGS 1413038 65 97255 2014 5 0.300

167 USGS 1413038 65 97255 2014 6 0.401

168 USGS 1413038 65 97255 2014 7 0.172

169 USGS 1413038 65 97255 2014 8 0.456

170 USGS 1413038 65 97255 2014 9 0.492

print(Greenwich.info())

# # Column Non-Null Count Dtype

#--- ------ -------------- -----

# 0 Agency 171 non-null object

# 1 Station 171 non-null int64

# 2 Parameter 171 non-null int64

# 3 TS_ID 171 non-null int64

# 4 Year 171 non-null int64

# 5 Month 171 non-null int64

# 6 Stage 165 non-null float64

Note that there are only 165 non-null observations for Stage. In the .csv file, six observations for Stage were NA values. pandas reads in these values as nan, which indicates a missing value.

We can get some summary statistics for the Stage variable.

SummaryStage = Greenwich['Stage'].describe()

print(SummaryStage)

count 165.000000

mean -0.000139

std 0.325459

min -0.806000

25% -0.265000

50% 0.037000

75% 0.263000

max 0.738000

Or we get summary statistics by Year for Stage.

SummaryYear = Greenwich[['Year', 'Stage']].groupby('Year').describe()

print(SummaryYear)

Stage

count mean std min 25% 50% 75% max

Year

2000 9.0 0.102556 0.261075 -0.523 0.05500 0.2190 0.26100 0.297

2001 12.0 0.005417 0.245176 -0.465 -0.12350 0.0145 0.18675 0.412

2002 12.0 -0.064750 0.329312 -0.522 -0.27600 -0.1625 0.19225 0.448

2003 10.0 0.074100 0.313308 -0.534 -0.09625 0.1660 0.32750 0.391

2004 11.0 -0.093000 0.300690 -0.419 -0.35450 -0.0990 0.06850 0.432

2005 12.0 0.007917 0.305198 -0.603 -0.17175 0.1320 0.24450 0.364

2006 12.0 -0.125167 0.338153 -0.623 -0.45325 -0.0865 0.12625 0.446

2007 12.0 -0.222083 0.291447 -0.731 -0.34725 -0.2110 0.00700 0.113

2008 12.0 -0.183167 0.338389 -0.806 -0.42700 -0.0725 0.01325 0.277

2009 12.0 -0.015917 0.432449 -0.802 -0.25675 -0.0110 0.36550 0.510

2010 12.0 0.056583 0.216794 -0.323 0.00350 0.0410 0.17150 0.393

2011 6.0 0.200333 0.404347 -0.350 -0.06750 0.2170 0.46100 0.738

2012 12.0 0.194417 0.296182 -0.355 0.05975 0.2585 0.42000 0.581

2013 12.0 0.082500 0.315971 -0.330 -0.27125 0.1150 0.35025 0.536

2014 9.0 0.145000 0.309832 -0.408 -0.13200 0.1720 0.40100 0.492

Exporting Plots as .png or .pdf Files

Setting the working directory

import os

print(os.getcwd())

C:\Users\Sal Mangiafico

os.chdir("C:/Users/Sal Mangiafico/Desktop")

print(os.getcwd())

C:\Users\Sal Mangiafico\Desktop

Exporting plot files

import pandas as pd

Rating = pd.array([7,10,9,5,4,8,6,5,10,8,7,8,8,9,5,9,6,10,10,8,7,7,8,10,7,7])

Steps = pd.array([8000,9000,10000,7000,6000,8000,7000,5000,9000,7000,8000,8000,8000,

7000,6000,8000,7000,10000,9000,8000,8000,6000,6000,8000,7000,7000])

Data = pd.DataFrame(

{'Rating': Rating,

'Steps': Steps})

print(Data.info())

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Rating 26 non-null Int64

1 Steps 26 non-null Int64

dtypes: Int64(2)

import seaborn as sns

import matplotlib.pyplot as plt

Set figure dimensions, plot a scatterplot with seaborn, and save figure as .png.

plt.figure(figsize=(6,4))

sns.scatterplot(data=Data, x='Steps', y='Rating')

plt.savefig('StepsRatingPlot.png', format='png', dpi=300)

plt.show()

Plot a scatterplot with seaborn, and save figure as .pdf.

sns.scatterplot(data=Data, x='Steps', y='Rating')

plt.savefig('StepsRatingPlot.pdf', format='pdf')

plt.show()

Plot a scatterplot with matplotlib, and save figure as .png.

plt.figure(figsize=(6,4))

plt.scatter(data=Data, x='Steps', y='Rating')

plt.savefig('StepsRatingPlot2.png', format='png', dpi=300)

plt.show()

Plot a scatterplot with matplotlib, and save figure as .pdf.

plt.scatter(data=Data, x='Steps', y='Rating')

plt.savefig('StepsRatingPlot2.pdf', format='pdf')

plt.show()

References

io: Core tools for working with streams. docs.python.org/3/library/io.html.

Matplotlib: Visualization with Python. matplotlib.org/.

NumPy: The fundamental package for scientific computing with Python. numpy.org/.

pandas. pandas.pydata.org/.

Python Software Foundation. www.python.org/.

Python – tutorial. docs.python.org/3/tutorial/index.html.

replit. replit.com/languages/python3.

SciPy. scipy.org/.

statsmodels. www.statsmodels.org/stable/index.html.

W3 Schools – pandas. www.w3schools.com/python/pandas/default.asp.

W3 Schools – Python. www.w3schools.com/python/default.asp.

W3 Schools – SciPy. www.w3schools.com/python/scipy/index.php.

A Python Companion to Extension Program Evaluation

Using Python

Inputting data as lists and creating a pandas data frame

Inputting data as data lines and creating a pandas data frame