[banner]

A Python Companion to Extension Program Evaluation

Salvatore S. Mangiafico

Using Python

Python and statistical analysis

 

Python is a programming language that is free to install, and is available for Windows, Linux, Mac, and other operating systems.  It is increasingly used for data analysis and statistical procedures.

 

Base Python is a general purpose programming language, and so isn’t designed specifically for conducting statistical analysis.  Luckily there are a variety of packages and libraries designed for handling data and conducting statistical analyses.

 

The package pandas can be used to create and manipulate data frames. Numpy is used to perform mathematical operations on arrays and matrixes.  SciPy conducts common statistical analyses, and statsmodels fits common statistical models and analyses using pandas data frames.  There are many other packages available for specific purposes.

 

Getting Started with Python

 

It can be a little cumbersome for a beginner to get Python installed, install an Integrated Development Environment (IDE), and install the libraries you want to use.

 

One approach is to install a software package that includes an IDE and the packages and libraries you are likely to need.  Anaconda and WinPython are two software packages that are relatively easy to install and allow the user to get started quickly.

 

Portable installation

 

WinPython allows for portable installation on Windows machines.  This means that the entirety of the software can be installed in a specified folder or on an external usb drive.  At the time of writing, an installation of WinPython used about 5 gigabytes of space.

 

Using Python online

 

There are websites on which you can run Python in an online environment.  Unfortunately, at the time of writing, many do not allow for the importation of packages that are used for statistical analysis.

 

One site I’ve found that is easy to use and allows for the easy importation of common packages is replit (replit.com/languages/python3). At the time of writing, it appears that you do need to create an account to create and run Python code.  Common packages can be imported with a simple import call in the code.

 

Integrated Development Environments

 

Most users will want to run Python scripts using an integrated development environment (IDE).  There are a variety of IDE’s available for use with Python.  IDE’s supply the user with an interface where it is easy to write, debug, and run code, and to view the results.

 

There are several common IDE’s used with Python, and there are strong supporters of each. Common IDE’s include Spyder, PyCharm, VS Code, and RStudio.

 

Using the Spyder environment

 

In my opinion, Spyder is a good choice for beginners.  Conveniently, it is included with both Anaconda and WinPython.

 

In Spyder, program code can be written and edited in the upper left Editor pane.  Code in the Editor pane can be run in small chunks by selecting the code and using the Run selected button.  Code can be saved as an .py file, and those files can subsequently be opened with Spyder. 

 

Code can be pasted into the Editor pane from a text file or website, and .py files can be opened with a plain text editor.

 

The results are reported in the Console pane on the lower right. 

 

The upper right pane has a small sub menu, at the bottom of that pane, that allows the user to choose if the pane displays Help, Variable Explorer, Plots, or Files.  You may need to select Plots to see the plots produced by the Python script.  And you may need to enable displaying plots with View > Panes > Plots

 

Plots can be saved as .png files, but you may want to use code to output plots as a .pdf file for better resolution.

 

Packages and Libraries

 

The terms “package” and “library” in Python can be confusing.  In Python lingo, a package is a collection of modules—and modules are a collection of functions and associated items.  A library is just a collection of packages.  For example, this book will use the package stats, which is contained in the SciPy library.

 

Package installation and updating

 

If packages are already installed, they can be imported simply with calls in the Python code like:

 

import pandas as pd

 

import scipy.stats as stats

 

 

The installation of additional packages and libraries will depend on how Python and the IDE was installed.

 

WinPython

 

For WinPython, to install additional packages you can use WinPython Command Prompt.exe in the installed WinPython folder, and call:

 

pip install packagename

 

where packagename is the name of the package you wish to install.

 

You can also update included packages using WinPython Command Prompt.exe:

 

pip install --upgrade packagename

 

Anaconda

 

Anaconda has its own package installation call, from the terminal.  On Windows, this may be called Anaconda Prompt, and may be found in the Programs > Anaconda folder.

 

conda install packagename

 

And you can also use pip install.

 

Other installations

 

For other installations in Windows, the Windows terminal is used.

 

Package versions

 

The Python version can be displayed with the sys package.

 

import sys

 

print(sys.version)

 

3.12.4

 

 

Many packages have the __version__ attribute, which contains the version of the package.

 

import pandas as pd

 

print(pd.__version__)

 

2.2.2

 

 

import scipy

 

print(scipy.__version__)

 

1.13.1

 

 

Inputting Data and Using Installed Packages

 

The following examples will show options for inputting data in Python and will use some common packages.

 

For these examples, your Python installation should have access to io, pandas, and SciPi

 

Inputting data as lists and creating a pandas data frame

 

This example creates variables Score and Student as lists and then combines them into a pandas data frame called Data.

 

pandas .describe() is used to summarize the data frame by Student

 

A Kruskal–Wallis test is then conducted with SciPy stats.  At this point, you don’t need to worry about why you might conduct a Kruskal–Wallis test or what the results mean.  The point is just to test that you can use the SciPy library, and see how data in a data frame can be manipulated to pass to the kruskal function as an example.

 

Remember to run only the blue (and green) code.  The purple code is the (truncated) output Python should produce.  Code in green are comments that can be run, but don’t produce output.

 

import pandas as pd

 

Score   = [10, 9, 8, 7, 7, 8, 9, 10, 6, 5, 4, 9, 10, 9, 10]

Student = ["Bugs", "Bugs", "Bugs", "Bugs","Bugs",

           "Daffy", "Daffy", "Daffy", "Daffy", "Daffy",

           "Taz", "Taz", "Taz", "Taz", "Taz"]

 

Data = pd.DataFrame(

    {'Student': Student,

     'Score': Score})

 

print(Data)

 

   Student  Score

0     Bugs     10

1     Bugs      9

2     Bugs      8

3     Bugs      7

4     Bugs      7

5    Daffy      8

6    Daffy      9

7    Daffy     10

8    Daffy      6

9    Daffy      5

10     Taz      4

11     Taz      9

12     Taz     10

13     Taz      9

14     Taz     10

 

 

print(Data.info())

 

#   Column   Non-Null Count  Dtype

---  ------   --------------  -----

 0   Student  15 non-null     object

 1   Score    15 non-null     int64

 

 

Summary = Data.groupby('Student').describe()

 

print(Summary)

 

        Score                                         

        count mean       std  min  25%  50%   75%   max

Student                                               

Bugs      5.0  8.2  1.303840  7.0  7.0  8.0   9.0  10.0

Daffy     5.0  7.6  2.073644  5.0  6.0  8.0   9.0  10.0

Taz       5.0  8.4  2.509980  4.0  9.0  9.0  10.0  10.0

 

 

### Define BugsScore as the Score for Bugs, and so on

 

BugsScore  = Data['Score'][Data['Student']=="Bugs"]

DaffyScore = Data['Score'][Data['Student']=="Daffy"]

TazScore   = Data['Score'][Data['Student']=="Taz"]

 

print(BugsScore)

 

0    10

1     9

2     8

3     7

4     7

 

 

### Conduct Kruskal-Wallis test

 

from scipy.stats import kruskal

 

ChiSquare, pValue = kruskal(BugsScore, DaffyScore, TazScore)

 

print("Chi-square:", round(ChiSquare, 4))

print("p-value:", round(pValue, 4))

 

Chi-square: 0.8483

p-value: 0.6543


Inputting data as data lines and creating a pandas data frame

 

The following example reads a data frame from a text object using io.  The result is a pandas data frame.

 

import io

 

import pandas as pd

 

TwoCats = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""

 

Cat            Score

"Leo Verdura"   8

"Leo Verdura"   8

"Leo Verdura"   8

"Leo Verdura"   8

"Leo Verdura"   8

"Leo Verdura"   8

"Katya"        10

"Katya"         9

"Katya"         9

"Katya"         8

"Katya"         7

"Katya"         7

"""))

 

print(TwoCats.info())

 

#   Column  Non-Null Count  Dtype

---  ------  --------------  -----

 0   Cat     12 non-null     object

 1   Score   12 non-null     int64

 

 

Summary = TwoCats.groupby('Cat').describe()

 

print(Summary)

 

            Score                                             

            count      mean      std  min   25%  50%  75%   max

Cat                                                           

Katya         6.0  8.333333  1.21106  7.0  7.25  8.5  9.0  10.0

Leo Verdura   6.0  8.000000  0.00000  8.0  8.00  8.0  8.0   8.0

 

 

Importing data as a .csv file

 

The following example reads a data frame, TwoTowns, a .csv file, from the internet.   This is hypothetical data.

 

You can also read .csv files from a local file on your computer.

 

import pandas as pd

 

TwoTowns = pd.read_csv('http://rcompanion.org/documents/TwoTowns.csv')

 

print(TwoTowns)

 

       Town  Income

0    Town.A   46284

1    Town.A   54467

2    Town.A   44532

3    Town.A   41384

4    Town.A   39621

..      ...     ...

197  Town.B   99372

198  Town.B  108216

199  Town.B   38107

200  Town.B   40726

201  Town.B   33557

 

 

print(TwoTowns.info())

 

 #   Column  Non-Null Count  Dtype

---  ------  --------------  -----

 0   Town    202 non-null    object

 1   Income  202 non-null    int64

 

 

Summary = TwoTowns.groupby('Town').describe()

 

print(Summary)

 

       Income                                ...                            

        count           mean            std  ...      50%       75%       max

Town                                         ...                            

Town.A  101.0   48146.425743   10851.668480  ...  48007.0   56421.0   77774.0

Town.B  101.0  115275.217822  163878.164535  ...  47223.0  108216.0  880027.0

 

 

Because the output is little difficult to read, and some collumns are omitted, we can clean up the output.

 

One option is to increase the collumns that will be displayed, and round the values.

 

 

pd.set_option('display.max_columns', None)

 

print(round(Summary))

 

pd.reset_option('display.max_columns')

 

       Income                                                          

        count      mean       std      min      25%      50%       75%       max  

Town                                                                    

Town.A  101.0   48146.0   10852.0  23557.0  40971.0  48007.0   56421.0   77774.0   

Town.B  101.0  115275.0  163878.0  29050.0  34142.0  47223.0  108216.0  880027.0

 

 

For this data, another option is convert the data frame to integer values.

 

SummaryInt = Summary.astype('int')

 

print(SummaryInt)

 

       Income                                                    

        count    mean     std    min    25%    50%     75%     max

Town                                                             

Town.A    101   48146   10851  23557  40971  48007   56421   77774

Town.B    101  115275  163878  29050  34142  47223  108216  880027

 

 

The following example reads a data frame, Greenwich, which is a .csv file, from the internet.   This file contains river stage measurements for Greenwich, NJ.

 

import pandas as pd

 

Greenwich = pd.read_csv('http://rcompanion.org/documents/Greenwich.csv')

 

print(Greenwich)

 

    Agency  Station  Parameter  TS_ID  Year  Month  Stage

0     USGS  1413038         65  97255  2000      4  0.297

1     USGS  1413038         65  97255  2000      5  0.219

2     USGS  1413038         65  97255  2000      6  0.055

3     USGS  1413038         65  97255  2000      7  0.260

4     USGS  1413038         65  97255  2000      8  0.293

..     ...      ...        ...    ...   ...    ...    ...

166   USGS  1413038         65  97255  2014      5  0.300

167   USGS  1413038         65  97255  2014      6  0.401

168   USGS  1413038         65  97255  2014      7  0.172

169   USGS  1413038         65  97255  2014      8  0.456

170   USGS  1413038         65  97255  2014      9  0.492

 

 

print(Greenwich.info())

 

# #   Column     Non-Null Count  Dtype 

#---  ------     --------------  ----- 

# 0   Agency     171 non-null    object

# 1   Station    171 non-null    int64 

# 2   Parameter  171 non-null    int64 

# 3   TS_ID      171 non-null    int64 

# 4   Year       171 non-null    int64 

# 5   Month      171 non-null    int64 

# 6   Stage      165 non-null    float64

 

Note that there are only 165 non-null observations for Stage.  In the .csv file, six observations for Stage were NA values.  pandas reads in these values as nan, which indicates a missing value.

 

We can get some summary statistics for the Stage variable.

 

SummaryStage = Greenwich['Stage'].describe()

 

print(SummaryStage)

 

count    165.000000

mean      -0.000139

std        0.325459

min       -0.806000

25%       -0.265000

50%        0.037000

75%        0.263000

max        0.738000

 

 

Or we get summary statistics by Year for Stage.

 

SummaryYear = Greenwich[['Year', 'Stage']].groupby('Year').describe()

 

print(SummaryYear)

 

     Stage                                                           

     count      mean       std    min      25%     50%      75%    max

Year                                                                 

2000   9.0  0.102556  0.261075 -0.523  0.05500  0.2190  0.26100  0.297

2001  12.0  0.005417  0.245176 -0.465 -0.12350  0.0145  0.18675  0.412

2002  12.0 -0.064750  0.329312 -0.522 -0.27600 -0.1625  0.19225  0.448

2003  10.0  0.074100  0.313308 -0.534 -0.09625  0.1660  0.32750  0.391

2004  11.0 -0.093000  0.300690 -0.419 -0.35450 -0.0990  0.06850  0.432

2005  12.0  0.007917  0.305198 -0.603 -0.17175  0.1320  0.24450  0.364

2006  12.0 -0.125167  0.338153 -0.623 -0.45325 -0.0865  0.12625  0.446

2007  12.0 -0.222083  0.291447 -0.731 -0.34725 -0.2110  0.00700  0.113

2008  12.0 -0.183167  0.338389 -0.806 -0.42700 -0.0725  0.01325  0.277

2009  12.0 -0.015917  0.432449 -0.802 -0.25675 -0.0110  0.36550  0.510

2010  12.0  0.056583  0.216794 -0.323  0.00350  0.0410  0.17150  0.393

2011   6.0  0.200333  0.404347 -0.350 -0.06750  0.2170  0.46100  0.738

2012  12.0  0.194417  0.296182 -0.355  0.05975  0.2585  0.42000  0.581

2013  12.0  0.082500  0.315971 -0.330 -0.27125  0.1150  0.35025  0.536

2014   9.0  0.145000  0.309832 -0.408 -0.13200  0.1720  0.40100  0.492



Exporting Plots as .png or .pdf Files

 

Setting the working directory

 

import os

 

print(os.getcwd())

 

C:\Users\Sal Mangiafico

 

os.chdir("C:/Users/Sal Mangiafico/Desktop")

 

print(os.getcwd())

 

 

C:\Users\Sal Mangiafico\Desktop

 

Exporting plot files

 

import pandas as pd

 

Rating = pd.array([7,10,9,5,4,8,6,5,10,8,7,8,8,9,5,9,6,10,10,8,7,7,8,10,7,7])

 

Steps  = pd.array([8000,9000,10000,7000,6000,8000,7000,5000,9000,7000,8000,8000,8000,

                   7000,6000,8000,7000,10000,9000,8000,8000,6000,6000,8000,7000,7000])

 

Data = pd.DataFrame(

{'Rating': Rating,

 'Steps': Steps})

 

print(Data.info())

 

#   Column  Non-Null Count  Dtype

---  ------  --------------  -----

 0   Rating  26 non-null     Int64

 1   Steps   26 non-null     Int64

dtypes: Int64(2)

 

 

import seaborn as sns

 

import matplotlib.pyplot as plt

 

 

Set figure dimensions, plot a scatterplot with seaborn, and save figure as .png.

 

plt.figure(figsize=(6,4))

 

sns.scatterplot(data=Data, x='Steps', y='Rating')

 

plt.savefig('StepsRatingPlot.png', format='png', dpi=300)

 

plt.show()

 

 

Plot a scatterplot with seaborn, and save figure as .pdf.

 

sns.scatterplot(data=Data, x='Steps', y='Rating')

 

plt.savefig('StepsRatingPlot.pdf', format='pdf')

 

plt.show()

 

 

Plot a scatterplot with matplotlib, and save figure as .png.

 

plt.figure(figsize=(6,4))

 

plt.scatter(data=Data, x='Steps', y='Rating')

 

plt.savefig('StepsRatingPlot2.png', format='png', dpi=300)

 

plt.show()

 

 

Plot a scatterplot with matplotlib, and save figure as .pdf.

 

plt.scatter(data=Data, x='Steps', y='Rating')

 

plt.savefig('StepsRatingPlot2.pdf', format='pdf')

 

plt.show()

 

 

References

 

io: Core tools for working with streams. docs.python.org/3/library/io.html.

 

Matplotlib: Visualization with Python. matplotlib.org/.

 

NumPy: The fundamental package for scientific computing with Python. numpy.org/.

 

pandas. pandas.pydata.org/.

 

Python Software Foundation. www.python.org/.

 

Python – tutorial. docs.python.org/3/tutorial/index.html.

 

replit. replit.com/languages/python3.

 

SciPy. scipy.org/.

 

statsmodels. www.statsmodels.org/stable/index.html.

 

W3 Schools – pandas. www.w3schools.com/python/pandas/default.asp.

 

W3 Schools – Python. www.w3schools.com/python/default.asp.

 

W3 Schools – SciPy. www.w3schools.com/python/scipy/index.php.