Python and statistical analysis
Python is a programming language that is free to install, and is available for Windows, Linux, Mac, and other operating systems. It is increasingly used for data analysis and statistical procedures.
Base Python is a general purpose programming language, and so isn’t designed specifically for conducting statistical analysis. Luckily there are a variety of packages and libraries designed for handling data and conducting statistical analyses.
The package pandas can be used to create and manipulate data frames. Numpy is used to perform mathematical operations on arrays and matrixes. SciPy conducts common statistical analyses, and statsmodels fits common statistical models and analyses using pandas data frames. There are many other packages available for specific purposes.
Getting Started with Python
It can be a little cumbersome for a beginner to get Python installed, install an Integrated Development Environment (IDE), and install the libraries you want to use.
One approach is to install a software package that includes an IDE and the packages and libraries you are likely to need. Anaconda and WinPython are two software packages that are relatively easy to install and allow the user to get started quickly.
Portable installation
WinPython allows for portable installation on Windows machines. This means that the entirety of the software can be installed in a specified folder or on an external usb drive. At the time of writing, an installation of WinPython used about 5 gigabytes of space.
Using Python online
There are websites on which you can run Python in an online environment. Unfortunately, at the time of writing, many do not allow for the importation of packages that are used for statistical analysis.
One site I’ve found that is easy to use and allows for the easy importation of common packages is replit (replit.com/languages/python3). At the time of writing, it appears that you do need to create an account to create and run Python code. Common packages can be imported with a simple import call in the code.
Integrated Development Environments
Most users will want to run Python scripts using an integrated development environment (IDE). There are a variety of IDE’s available for use with Python. IDE’s supply the user with an interface where it is easy to write, debug, and run code, and to view the results.
There are several common IDE’s used with Python, and there are strong supporters of each. Common IDE’s include Spyder, PyCharm, VS Code, and RStudio.
Using the Spyder environment
In my opinion, Spyder is a good choice for beginners. Conveniently, it is included with both Anaconda and WinPython.
In Spyder, program code can be written and edited in the upper left Editor pane. Code in the Editor pane can be run in small chunks by selecting the code and using the Run selected button. Code can be saved as an .py file, and those files can subsequently be opened with Spyder.
Code can be pasted into the Editor pane from a text file or website, and .py files can be opened with a plain text editor.
The results are reported in the Console pane on the lower right.
The upper right pane has a small sub menu, at the bottom of that pane, that allows the user to choose if the pane displays Help, Variable Explorer, Plots, or Files. You may need to select Plots to see the plots produced by the Python script. And you may need to enable displaying plots with View > Panes > Plots.
Plots can be saved as .png files, but you may want to use code to output plots as a .pdf file for better resolution.
Packages and Libraries
The terms “package” and “library” in Python can be confusing. In Python lingo, a package is a collection of modules—and modules are a collection of functions and associated items. A library is just a collection of packages. For example, this book will use the package stats, which is contained in the SciPy library.
Package installation and updating
If packages are already installed, they can be imported simply with calls in the Python code like:
import pandas as pd
import scipy.stats as stats
The installation of additional packages and libraries will depend on how Python and the IDE was installed.
WinPython
For WinPython, to install additional packages you can use WinPython Command Prompt.exe in the installed WinPython folder, and call:
pip install packagename
where packagename is the name of the package you wish to install.
You can also update included packages using WinPython Command Prompt.exe:
pip install --upgrade packagename
Anaconda
Anaconda has its own package installation call, from the terminal. On Windows, this may be called Anaconda Prompt, and may be found in the Programs > Anaconda folder.
conda install packagename
And you can also use pip install.
Other installations
For other installations in Windows, the Windows terminal is used.
Package versions
The Python version can be displayed with the sys package.
import sys
print(sys.version)
3.12.4
Many packages have the __version__ attribute, which contains the version of the package.
import pandas as pd
print(pd.__version__)
2.2.2
import scipy
print(scipy.__version__)
1.13.1
Inputting Data and Using Installed Packages
The following examples will show options for inputting data in Python and will use some common packages.
For these examples, your Python installation should have access to io, pandas, and SciPi.
Inputting data as lists and creating a pandas data frame
This example creates variables Score and Student as lists and then combines them into a pandas data frame called Data.
pandas .describe() is used to summarize the data frame by Student.
A Kruskal–Wallis test is then conducted with SciPy stats. At this point, you don’t need to worry about why you might conduct a Kruskal–Wallis test or what the results mean. The point is just to test that you can use the SciPy library, and see how data in a data frame can be manipulated to pass to the kruskal function as an example.
Remember to run only the blue (and green) code. The purple code is the (truncated) output Python should produce. Code in green are comments that can be run, but don’t produce output.
import pandas as pd
Score = [10, 9, 8, 7, 7, 8, 9, 10, 6, 5, 4, 9, 10, 9, 10]
Student = ["Bugs", "Bugs", "Bugs", "Bugs","Bugs",
"Daffy", "Daffy", "Daffy", "Daffy", "Daffy",
"Taz", "Taz", "Taz", "Taz", "Taz"]
Data = pd.DataFrame(
{'Student': Student,
'Score': Score})
print(Data)
Student Score
0 Bugs 10
1 Bugs 9
2 Bugs 8
3 Bugs 7
4 Bugs 7
5 Daffy 8
6 Daffy 9
7 Daffy 10
8 Daffy 6
9 Daffy 5
10 Taz 4
11 Taz 9
12 Taz 10
13 Taz 9
14 Taz 10
print(Data.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Student 15 non-null object
1 Score 15 non-null int64
Summary = Data.groupby('Student').describe()
print(Summary)
Score
count mean std min 25% 50% 75% max
Student
Bugs 5.0 8.2 1.303840 7.0 7.0 8.0 9.0 10.0
Daffy 5.0 7.6 2.073644 5.0 6.0 8.0 9.0 10.0
Taz 5.0 8.4 2.509980 4.0 9.0 9.0 10.0 10.0
### Define BugsScore as the Score for Bugs, and so on
BugsScore = Data['Score'][Data['Student']=="Bugs"]
DaffyScore = Data['Score'][Data['Student']=="Daffy"]
TazScore = Data['Score'][Data['Student']=="Taz"]
print(BugsScore)
0 10
1 9
2 8
3 7
4 7
### Conduct Kruskal-Wallis test
from scipy.stats import kruskal
ChiSquare, pValue = kruskal(BugsScore, DaffyScore, TazScore)
print("Chi-square:", round(ChiSquare, 4))
print("p-value:", round(pValue, 4))
Chi-square: 0.8483
p-value: 0.6543
Inputting data as data lines and creating a pandas data frame
The following example reads a data frame from a text object using io. The result is a pandas data frame.
import io
import pandas as pd
TwoCats = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""
Cat Score
"Leo Verdura" 8
"Leo Verdura" 8
"Leo Verdura" 8
"Leo Verdura" 8
"Leo Verdura" 8
"Leo Verdura" 8
"Katya" 10
"Katya" 9
"Katya" 9
"Katya" 8
"Katya" 7
"Katya" 7
"""))
print(TwoCats.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Cat 12 non-null object
1 Score 12 non-null int64
Summary = TwoCats.groupby('Cat').describe()
print(Summary)
Score
count mean std min 25% 50% 75% max
Cat
Katya 6.0 8.333333 1.21106 7.0 7.25 8.5 9.0 10.0
Leo Verdura 6.0 8.000000 0.00000 8.0 8.00 8.0 8.0 8.0
Importing data as a .csv file
The following example reads a data frame, TwoTowns, a .csv file, from the internet. This is hypothetical data.
You can also read .csv files from a local file on your computer.
import pandas as pd
TwoTowns = pd.read_csv('http://rcompanion.org/documents/TwoTowns.csv')
print(TwoTowns)
Town Income
0 Town.A 46284
1 Town.A 54467
2 Town.A 44532
3 Town.A 41384
4 Town.A 39621
.. ... ...
197 Town.B 99372
198 Town.B 108216
199 Town.B 38107
200 Town.B 40726
201 Town.B 33557
print(TwoTowns.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Town 202 non-null object
1 Income 202 non-null int64
Summary = TwoTowns.groupby('Town').describe()
print(Summary)
Income ...
count mean std ... 50% 75% max
Town ...
Town.A 101.0 48146.425743 10851.668480 ... 48007.0 56421.0 77774.0
Town.B 101.0 115275.217822 163878.164535 ... 47223.0 108216.0 880027.0
Because the output is little difficult to read, and some collumns are omitted, we can clean up the output.
One option is to increase the collumns that will be displayed, and round the values.
pd.set_option('display.max_columns', None)
print(round(Summary))
pd.reset_option('display.max_columns')
Income
count mean std min 25% 50% 75% max
Town
Town.A 101.0 48146.0 10852.0 23557.0 40971.0 48007.0 56421.0 77774.0
Town.B 101.0 115275.0 163878.0 29050.0 34142.0 47223.0 108216.0 880027.0
For this data, another option is convert the data frame to integer values.
SummaryInt = Summary.astype('int')
print(SummaryInt)
Income
count mean std min 25% 50% 75% max
Town
Town.A 101 48146 10851 23557 40971 48007 56421 77774
Town.B 101 115275 163878 29050 34142 47223 108216 880027
The following example reads a data frame, Greenwich, which is a .csv file, from the internet. This file contains river stage measurements for Greenwich, NJ.
import pandas as pd
Greenwich = pd.read_csv('http://rcompanion.org/documents/Greenwich.csv')
print(Greenwich)
Agency Station Parameter TS_ID Year Month Stage
0 USGS 1413038 65 97255 2000 4 0.297
1 USGS 1413038 65 97255 2000 5 0.219
2 USGS 1413038 65 97255 2000 6 0.055
3 USGS 1413038 65 97255 2000 7 0.260
4 USGS 1413038 65 97255 2000 8 0.293
.. ... ... ... ... ... ... ...
166 USGS 1413038 65 97255 2014 5 0.300
167 USGS 1413038 65 97255 2014 6 0.401
168 USGS 1413038 65 97255 2014 7 0.172
169 USGS 1413038 65 97255 2014 8 0.456
170 USGS 1413038 65 97255 2014 9 0.492
print(Greenwich.info())
# # Column Non-Null Count Dtype
#--- ------ -------------- -----
# 0 Agency 171 non-null object
# 1 Station 171 non-null int64
# 2 Parameter 171 non-null int64
# 3 TS_ID 171 non-null int64
# 4 Year 171 non-null int64
# 5 Month 171 non-null int64
# 6 Stage 165 non-null float64
Note that there are only 165 non-null observations for Stage. In the .csv file, six observations for Stage were NA values. pandas reads in these values as nan, which indicates a missing value.
We can get some summary statistics for the Stage variable.
SummaryStage = Greenwich['Stage'].describe()
print(SummaryStage)
count 165.000000
mean -0.000139
std 0.325459
min -0.806000
25% -0.265000
50% 0.037000
75% 0.263000
max 0.738000
Or we get summary statistics by Year for Stage.
SummaryYear = Greenwich[['Year', 'Stage']].groupby('Year').describe()
print(SummaryYear)
Stage
count mean std min 25% 50% 75% max
Year
2000 9.0 0.102556 0.261075 -0.523 0.05500 0.2190 0.26100 0.297
2001 12.0 0.005417 0.245176 -0.465 -0.12350 0.0145 0.18675 0.412
2002 12.0 -0.064750 0.329312 -0.522 -0.27600 -0.1625 0.19225 0.448
2003 10.0 0.074100 0.313308 -0.534 -0.09625 0.1660 0.32750 0.391
2004 11.0 -0.093000 0.300690 -0.419 -0.35450 -0.0990 0.06850 0.432
2005 12.0 0.007917 0.305198 -0.603 -0.17175 0.1320 0.24450 0.364
2006 12.0 -0.125167 0.338153 -0.623 -0.45325 -0.0865 0.12625 0.446
2007 12.0 -0.222083 0.291447 -0.731 -0.34725 -0.2110 0.00700 0.113
2008 12.0 -0.183167 0.338389 -0.806 -0.42700 -0.0725 0.01325 0.277
2009 12.0 -0.015917 0.432449 -0.802 -0.25675 -0.0110 0.36550 0.510
2010 12.0 0.056583 0.216794 -0.323 0.00350 0.0410 0.17150 0.393
2011 6.0 0.200333 0.404347 -0.350 -0.06750 0.2170 0.46100 0.738
2012 12.0 0.194417 0.296182 -0.355 0.05975 0.2585 0.42000 0.581
2013 12.0 0.082500 0.315971 -0.330 -0.27125 0.1150 0.35025 0.536
2014 9.0 0.145000 0.309832 -0.408 -0.13200 0.1720 0.40100 0.492
Exporting Plots as .png or .pdf Files
Setting the working directory
import os
print(os.getcwd())
C:\Users\Sal Mangiafico
os.chdir("C:/Users/Sal Mangiafico/Desktop")
print(os.getcwd())
C:\Users\Sal Mangiafico\Desktop
Exporting plot files
import pandas as pd
Rating = pd.array([7,10,9,5,4,8,6,5,10,8,7,8,8,9,5,9,6,10,10,8,7,7,8,10,7,7])
Steps = pd.array([8000,9000,10000,7000,6000,8000,7000,5000,9000,7000,8000,8000,8000,
7000,6000,8000,7000,10000,9000,8000,8000,6000,6000,8000,7000,7000])
Data = pd.DataFrame(
{'Rating': Rating,
'Steps': Steps})
print(Data.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rating 26 non-null Int64
1 Steps 26 non-null Int64
dtypes: Int64(2)
import seaborn as sns
import matplotlib.pyplot as plt
Set figure dimensions, plot a scatterplot with seaborn, and save figure as .png.
plt.figure(figsize=(6,4))
sns.scatterplot(data=Data, x='Steps', y='Rating')
plt.savefig('StepsRatingPlot.png', format='png', dpi=300)
plt.show()
Plot a scatterplot with seaborn, and save figure as .pdf.
sns.scatterplot(data=Data, x='Steps', y='Rating')
plt.savefig('StepsRatingPlot.pdf', format='pdf')
plt.show()
Plot a scatterplot with matplotlib, and save figure as .png.
plt.figure(figsize=(6,4))
plt.scatter(data=Data, x='Steps', y='Rating')
plt.savefig('StepsRatingPlot2.png', format='png', dpi=300)
plt.show()
Plot a scatterplot with matplotlib, and save figure as .pdf.
plt.scatter(data=Data, x='Steps', y='Rating')
plt.savefig('StepsRatingPlot2.pdf', format='pdf')
plt.show()
References
io: Core tools for working with streams. docs.python.org/3/library/io.html.
Matplotlib: Visualization with Python. matplotlib.org/.
NumPy: The fundamental package for scientific computing with Python. numpy.org/.
pandas. pandas.pydata.org/.
Python Software Foundation. www.python.org/.
Python – tutorial. docs.python.org/3/tutorial/index.html.
replit. replit.com/languages/python3.
SciPy. scipy.org/.
statsmodels. www.statsmodels.org/stable/index.html.
W3 Schools – pandas. www.w3schools.com/python/pandas/default.asp.
W3 Schools – Python. www.w3schools.com/python/default.asp.
W3 Schools – SciPy. www.w3schools.com/python/scipy/index.php.