[banner]

A Python Companion to Extension Program Evaluation

Salvatore S. Mangiafico

Descriptive Statistics

Descriptive statistics in SAEPER

 

For a discussion of various descriptive statistics, see the corresponding chapter in Summary and Analysis of Extension Program Evaluation in R (rcompanion.org/handbook/C_02.html). 

 

Importing packages in this chapter

 

The following commands will import required packages used in this chapter from libraries and assign them common aliases.  You may need install these libraries first.

 

import io

 

import numpy as np

 

import scipy.stats as stats

 

import pandas as pd

 

import scipy.stats.mstats as mstats

 

from pandas.api.types import CategoricalDtype

 

 

Descriptive statistics for interval/ratio data

 

For this example, imagine that Ren and Stimpy have each held eight workshops educating the public about water conservation at home.  They are interested in how many people showed up to the workshops.


Data = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""

 

Instructor  Location  Attendees
Ren         North      7
Ren         North     22
Ren         North      6
Ren         North     15
Ren         South     12
Ren         South     13
Ren         South     14
Ren         South     16
Stimpy      North     18
Stimpy      North     17
Stimpy      North     15
Stimpy      North      9
Stimpy      South     15
Stimpy      South     11
Stimpy      South     19
Stimpy      South     23

"""))

 

### Convert Instructor and Location to category type

 

Data['Instructor'] = Data['Instructor'].astype('category')

 

Data['Location']   = Data['Location'].astype('category')

 

 

### Display some summary statistics for the data frame

 

print(Data.info())

 

 #   Column      Non-Null Count  Dtype  

---  ------      --------------  -----  

 0   Instructor  16 non-null     category

 1   Location    16 non-null     category

 2   Attendees   16 non-null     int64

 

  

print(Data.describe())

 

       Attendees

count  16.000000

mean   14.500000

std     4.830459

min     6.000000

25%    11.750000

50%    15.000000

75%    17.250000

max    23.000000

 

 

print(Data.describe(include='category'))

 

       Instructor Location

count          16       16

unique          2        2

top           Ren    North

freq            8        8



Functions for sum and count

 

For a pandas data frame, the sum of a variable can be found with the sum function, and the number of observations can be found with the count function.

 

Because the data are housed in a data frame, we can use the convention Data['Attendees'] to access the variable Attendees within the data frame Data.

 


Data['Attendees'].sum()

 

232

 

 

Data['Attendees'].count()

 

16


Statistics of location for interval/ratio data

 

Mean

 

The mean is simply the sum of the values divided by the number of values.


Data['Attendees'].sum() / Data['Attendees'].count()

 

14.5

 

The mean function in pandas will return the mean.

 

Data['Attendees'].mean()

 

14.5

 

 

Here, with skewed income data, the mean may not be the best representative of the center of the data.

 

IncomeList = [49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000]

 

Income = pd.Series(IncomeList)

 

Income.mean()

 

233300.0

 

 

Median

 

The pandas package can also calculate the median for a variable in a data frame or for an array.

 

Data['Attendees'].median()

 

15.0

 

 

For individual arrays, we can use pd.Series to define the array.

 

Attendees2 = pd.Series([7, 22, 6, 15, 12, 13, 14, 16, 18, 17, 15, 9, 15, 11, 19, 1000])

 

Attendees2.median()

 

15.0

 

 

In the following example, we’ll change Stimpy’s last Attendees value from 23 to 1000 to show that this doesn’t affect the median.

 

Attendees2 = pd.Series([7, 22, 6, 15, 12, 13, 14, 16, 18, 17, 15, 9, 15, 11, 19, 1000])

 

Attendees2.median()

15.0

 

 

The median income for the town discussed above is $40,500.  Half the families in the town have an income above this amount, and half have an income below this amount.  Note that this differs considerably from the mean calculated above.

 

Income = pd.Series(

           [49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000])

 

Income.median()

 

40500.0

 

 

Mode

 

For our Ren and Stimpy example, the value 15 occurs three times in Attendees and so is the mode.


Data['Attendees'].mode()

 

15

 

 

Statistics of variation for interval/ratio data

 

Standard deviation

 

Data['Attendees'].std()

 

4.83045891539648

 

 

Income = pd.Series([49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000])

 

Income.std()

 

620834.5727056981

 

 

Median absolute deviation

 

Median absolute deviation for variable in a data frame or for an array can be calculated using the stats package from the SciPy module.  The scale option is used to adjust the result.  For example to return a result that is similar to the standard deviation for normal samples. scale='normal' is equivalent to scale= 0.67449.

 

stats.median_abs_deviation(Data['Attendees'], scale='normal')

 

4.447805008228439

 

 

stats.median_abs_deviation(Data['Attendees'])

 

3.0

 

 

Income = pd.Series([49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000])

 

stats.median_abs_deviation(Income, scale='normal')

 

11119.512520571097

 

 

stats.median_abs_deviation(Income)

 

7500.0

 

 

Standard error of the mean

 

The standard error of the mean can be calculated as the standard deviation divided by the square root of the number of observations.  It can also be calculated with SciPy stats.

 

SD = Data['Attendees'].std()

 

Length = Data['Attendees'].count()

 

print(SD / np.sqrt(Length))

 

1.20761472884912

 

 

stats.sem(Data['Attendees'])

 

1.20761472884912

 

 

Five-number summary, quartiles, percentiles

 

The five-number summary, consisting of the minimum, 25th percentile, median, 75th percentile, and maximum, can be calculated with the describe function in pandas.

 

Summary = Data['Attendees'].describe()

 

print(Summary)

 

count    16.000000

mean     14.500000

std       4.830459

min       6.000000

25%      11.750000

50%      15.000000

75%      17.250000

max      23.000000

 

 

Quantiles can also be calculated with the quantile function in pandas and with the quantile function in NumPy.  The function in NumPy has more options for estimation methods for quantiles.

 

Data['Attendees'].quantile(0.25)

 

11.75

 

### Note that the default method matches the result from the default in R.

 

 

Data['Attendees'].quantile(0.25, interpolation='midpoint')

 

11.5

 

### In this case the 'midpoint' option reports the midpoint between 11 and 12,

###   the 4th and 5th observations when Attendees is sorted.

 

 

np.quantile(Data['Attendees'], 0.25)

 

11.75

 

 

np.quantile(Data['Attendees'], 0.25, method='midpoint')

 

11.5

 

 

Calculate the 95th percentile.

 

Data['Attendees'].quantile(0.95)

 

22.25

 

 

Statistics for grouped interval/ratio data

 

Continuing to examine the Ren and Stimpy data, summary statistics by group and by combination of groups can be obtained with the describe and groupby functions in pandas.

 

Summary = Data.groupby('Instructor')['Attendees'].describe()

 

print(Summary)

 

               count    mean       std  min    25%   50%    75%   max

Instructor                                                          

Ren              8.0  13.125  5.083236  6.0  10.75  13.5  15.25  22.0

Stimpy           8.0  15.875  4.454131  9.0  14.00  16.0  18.25  23.0

 

 

Summary2 = Data.groupby(['Instructor', 'Location'])['Attendees'].describe()

 

print(Summary2)

 

                        count   mean       std   min    25%   50%    75%   max

Instructor Location                                                          

Ren        North          4.0  12.50  7.505553   6.0   6.75  11.0  16.75  22.0

           South          4.0  13.75  1.707825  12.0  12.75  13.5  14.50  16.0

Stimpy     North          4.0  14.75  4.031129   9.0  13.50  16.0  17.25  18.0

           South          4.0  17.00  5.163978  11.0  14.00  17.0  20.00  23.0

 

 

Adding variables to a data frame

 

For this example, we can add the standard error of the mean to summary statistics of the grouped data.

 

Summary2['sem'] = Summary2['std'] / np.sqrt(Summary2['count'])

 

print(Summary2)

                     

                        count   mean       std  ...    75%   max       sem

Instructor Location                             ...                      

Ren        North          4.0  12.50  7.505553  ...  16.75  22.0  3.752777

           South          4.0  13.75  1.707825  ...  14.50  16.0  0.853913

Stimpy     North          4.0  14.75  4.031129  ...  17.25  18.0  2.015564

           South          4.0  17.00  5.163978  ...  20.00  23.0  2.581989

 

 

Other statistics for grouped variables

 

The groupby function in pandas can be used to calculate summary statistics not included in describe.  Here, the 95th percentile is calculated for each combination of Instructor and Location.

 

Summary3 = Data.groupby(['Instructor', 'Location'])['Attendees'].quantile(0.95)

 

print(Summary3)

 

Instructor Location          

Ren        North         20.95

           South         15.70

Stimpy     North         17.85

           South         22.40

 

 

Summary3 in this case a series in pandas. We can change the name of variable for convenience.

 

print(Summary3.info)

 

Name: Attendees, dtype: float64>

 

 

Summary3.name = '95thPercentile'

 

print(Summary3)

 

Instructor Location               

Ren        North              20.95

           South              15.70

Stimpy     North              17.85

           South              22.40

 

Name: 95thPercentile, dtype: float64

 

 

We can also convert Summary3 to a data frame.

 

Summary4 = pd.DataFrame(Summary3)

 

print(Summary4)

 

                     95thPercentile

Instructor Location               

Ren        North              20.95

           South              15.70

Stimpy     North              17.85

           South              22.40

 

 

Summaries for data frames

 

Summary statistics for variables in a data frame can be obtained with the pandas functions info and describe.

 

print(Data.info())

 

 #   Column      Non-Null Count  Dtype  

---  ------      --------------  -----  

 0   Instructor  16 non-null     category

 1   Location    16 non-null     category

 2   Attendees   16 non-null     int64

 

  

print(Data.describe())

 

       Attendees

count  16.000000

mean   14.500000

std     4.830459

min     6.000000

25%    11.750000

50%    15.000000

75%    17.250000

max    23.000000

 

 

Dealing with missing values

 

For this example, we’ll change two values of Attendees to missing values.  Here, the missing values are coded with NA, which is the convention in R.  read_table will convert them to nan values, which is the usual convention in Python for missing values or other values that are “not a number”.

 

Note that in the summary from the info call, that there are 14 non-null values for Attendees.

 

Data2 = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""


Instructor  Location  Attendees
Ren         North      7
Ren         North     22
Ren         North      6
Ren         North     15
Ren         South     12
Ren         South     13
Ren         South     NA
Ren         South     16
Stimpy      North     18
Stimpy      North     17
Stimpy      North     NA
Stimpy      North      9
Stimpy      South     15
Stimpy      South     11
Stimpy      South     19
Stimpy      South     23
"""))

print(Data2.info())

 

#   Column      Non-Null Count  Dtype 

---  ------      --------------  ----- 

 0   Instructor  16 non-null     object

 1   Location    16 non-null     object

 2   Attendees   14 non-null     float64

 

 

By default, many pandas functions will remove nan values by default.  If nan values are included in a calculation, the returned result will often be nan.

 

Data2['Attendees'].mean()

 

14.5

 

 

Data2['Attendees'].mean(skipna=False)

 

nan

 

 

Removing missing values

 

The pandas function dropna can remove rows with missing values.  Observations with no missing values for any variable are sometimes called “complete cases”.  The process is sometimes called “listwise deletion”.

 

Note here that the data frame DataComplete has 14 observations.  The two rows with nan observations in Attendees have been removed from the data frame.

 

DataComplete = Data2.dropna()

 

print(DataComplete.info())

 

 #   Column      Non-Null Count  Dtype 

---  ------      --------------  ----- 

 0   Instructor  14 non-null     object

 1   Location    14 non-null     object

 2   Attendees   14 non-null     float64

 

 

The dropna function can also remove any columns, or variables, that have missing values.  Note here that the variable Attendees has been removed entirely, but Instructor and Location retain 16 observations.

 

DataCompleteColumns = Data2.dropna(axis=1)

 

print(DataCompleteColumns.info())

 

Data columns (total 2 columns):

 #   Column      Non-Null Count  Dtype

---  ------      --------------  -----

 0   Instructor  16 non-null     object

 1   Location    16 non-null     object

 

 

Counting missing values

 

The isnull function will return True if there are any missing values in a data frame or column in a data frame.

 

Data2.isnull().values.any()

 

True

 

 

Data2['Attendees'].isnull().values.any()

 

True

 

 

Data2['Location'].isnull().values.any()

 

False

 

 

Likewise, the isnull and sum functions can be combined to count the missing values in a data frame or variable in a data frame.

 

Data2.isnull().sum()

 

Instructor    0

Location      0

Attendees     2

 

### Number of nan values in each variable

 

 

Data2['Attendees'].isnull().sum()

 

2

 

### Number of nan values in Attendees in Data2

 

 

Data2['Location'].isnull().sum()

 

0

 

### Number of nan values in Location in Data2

 

 

Missing values in arrays

 

For a pd.Series, missing values can be designated with None.

 

Attendees2 = pd.Series([7, 22, 6, 15, 12, 13, None, 16, 18, 17, None, 9, 15, 11, 19, 23])

 

print(Attendees2.info())

 

RangeIndex: 16 entries, 0 to 15

Series name: None

Non-Null Count  Dtype 

--------------  ----- 

14 non-null     float64

 

### 16 values, 14 of which are non-null

 

 

print(Attendees2.isnull().sum())

 

2

 

### 2 null values

 

 

print(Attendees2.mean())

 

14.5

 

 

print(Attendees2.mean(skipna=False))

 

nan

 

 

Statistics of shape for interval/ratio data

 

Returning to the original Ren and Stimpy data set, we can obtain the skewness and kurtosis for Attendees with the skew and kurtosis functions in pandas.

 

There are different methods to calculate skew and kurtosis, so different software may report different results by default.  See the documentation for details.  It appears that pandas calculates these statistics in the manner of type=2 in R with psych::describe.

 

print(Data['Attendees'].skew())

 

-0.0506986941557648

 

 

print(Data['Attendees'].kurtosis())

 

-0.32657277416461117

 

 

Descriptive statistics for ordinal data

 

To obtain descriptive statistics for ordinal data, the data can be examined as either numeric data or nominal data.  In Python, the data could also be coded as ordered category data.

 

Example of descriptive statistics for ordinal data

 

For this example, Arthur and Baxter have each held a workshop educating the public about preventing water pollution at home.  They are interested in the education level of people who attended the workshops.  We can consider education level an ordinal variable.  They also collected information on the gender and county of participants.

 

For this data, we have manually coded Education with numbers in a separate variable Ed.code.  These numbers list the education in order:  High school  <  Associate’s  <  Bachelor’s  <  Master’s  <  Ph.D.

 

Data = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""

 

Date          Instructor     Student  Gender  County       Education  EdCode

2015-11-01  "Arthur Read"    a        female  "Elwood"     BA         3

2015-11-01  "Arthur Read"    b        female  "Bear Lake"  PHD        5

2015-11-01  "Arthur Read"    c        male    "Elwood"     BA         3

2015-11-01  "Arthur Read"    d        female  "Elwood"     MA         4

2015-11-01  "Arthur Read"    e        male    "Elwood"     HS         1

2015-11-01  "Arthur Read"    f        female  "Bear Lake"  MA         4

2015-11-01  "Arthur Read"    g        male    "Elwood"     HS         1

2015-11-01  "Arthur Read"    h        female  "Elwood"     BA         3

2015-11-01  "Arthur Read"    i        female  "Elwood"     BA         3

2015-11-01  "Arthur Read"    j        female  "Elwood"     BA         3

2015-12-01  "Buster Baxter"  k        male    "Elwood"     MA         4

2015-12-01  "Buster Baxter"  l        male    "Bear Lake"  MA         4

2015-12-01  "Buster Baxter"  m        female  "Elwood"     AA         2

2015-12-01  "Buster Baxter"  n        male    "Elwood"     AA         2

2015-12-01  "Buster Baxter"  o        other   "Elwood"     BA         3

2015-12-01  "Buster Baxter"  p        female  "Elwood"     BA         3

2015-12-01  "Buster Baxter"  q        female  "Bear Lake"  PHD        5

"""))

 

 

### Treat Date, Instructor, Gender, County, and Education as category type

 

Data['Date']       = Data['Date'].astype('category')

Data['Instructor'] = Data['Instructor'].astype('category')

Data['Gender']     = Data['Gender'].astype('category')

Data['County']     = Data['County'].astype('category')

Data['Education']  = Data['Education'].astype('category')

 

print(Data.info())

 

 #   Column      Non-Null Count  Dtype  

---  ------      --------------  -----  

 0   Date        17 non-null     category

 1   Instructor  17 non-null     category

 2   Student     17 non-null     object 

 3   Gender      17 non-null     category

 4   County      17 non-null     category

 5   Education   17 non-null     category

 6   EdCode      17 non-null     int64 

 

 

By default, the order of categories will be alphabetized.

 

print(Data['Education'].cat.categories)

 

Index(['AA', 'BA', 'HS', 'MA', 'PHD'], dtype='object')

 

 

The levels of categories can be re-ordered.  This is useful for examining output and plots.

 

EducationLevels =["HS", "AA", "BA", "MA", "PHD" ]

 

Data['Education'] = Data['Education'].cat.reorder_categories(EducationLevels)

 

print(Data['Education'].cat.categories)

 

Index(['HS', 'AA', 'BA', 'MA', 'PHD'], dtype='object')

 

 

Optional code to assign numeric values to a variable based on a category variable

 

Here, EdCode was already included as a numeric equivalent to Education.  But we could create a numeric variable to reflect the numeric equivalent of Education if it weren’t already included in the data frame.

 

Here, we'll create a new variable EdCode2 with the numeric assignment for Education.

 

This can be accomplished by creating a dictionary, here called Mapper.

 

Mapper = {"HS":1, "AA":2, "BA":3, "MA":4, "PHD":5}

 

Data["EdCode2"] = Data["Education"].replace(Mapper)

 

Data['EdCode2'] = Data['EdCode2'].astype('int64')

 

 

print(Data.info())

 

#   Column      Non-Null Count  Dtype  

---  ------      --------------  -----  

 0   Date        17 non-null     category

 1   Instructor  17 non-null     category

 2   Student     17 non-null     object 

 3   Gender      17 non-null     category

 4   County      17 non-null     category

 5   Education   17 non-null     category

 6   EdCode      17 non-null     int64  

 7   EdCode2     17 non-null     int64  

 

 

This could also be accomplished by manually defining a new variable, EdCode3, based on the value of the variable Education.

 

Data['EdCode3'] = 9999

 

Data.loc[Data['Education'] == 'HS',  'EdCode3'] = 1

Data.loc[Data['Education'] == 'AA',  'EdCode3'] = 2

Data.loc[Data['Education'] == 'BA',  'EdCode3'] = 3

Data.loc[Data['Education'] == 'MA',  'EdCode3'] = 4

Data.loc[Data['Education'] == 'PHD', 'EdCode3'] = 5

 

print(Data.info())

 

 #   Column      Non-Null Count  Dtype  

---  ------      --------------  -----  

 0   Date        17 non-null     category

 1   Instructor  17 non-null     category

 2   Student     17 non-null     object 

 3   Gender      17 non-null     category

 4   County      17 non-null     category

 5   Education   17 non-null     category

 6   EdCode      17 non-null     int64  

 7   EdCode2     17 non-null     int64  

 8   EdCode3     17 non-null     int64 

 

 

Optional code to change a category variable to an ordered category variable

 

Ordinal variables can be explicitly designated as ordered category variables.

 

Note that when the array for EducationOrdered is printed, the order is indicated with less than symbols, e.g. 'HS' < 'AA'.

 

EducationLevels =["HS", "AA", "BA", "MA", "PHD" ]

 

Data['EducationOrdered'] = Data['Education'].astype(CategoricalDtype(

                             categories=EducationLevels,

                             ordered=True))

 

print(Data['EducationOrdered'])

 

Name: Education, dtype: category

Categories (5, object): ['HS' < 'AA' < 'BA' < 'MA' < 'PHD']

 

 

Dropping unused variables

 

### Drop EdCode2 and EdCode3

 

Data = Data.drop(columns=['EdCode2', 'EdCode3', 'EducationOrdered'])

 

print(Data.info())

 

#   Column            Non-Null Count  Dtype  

---  ------            --------------  -----  

 0   Date              17 non-null     category

 1   Instructor        17 non-null     category

 2   Student           17 non-null     object 

 3   Gender            17 non-null     category

 4   County            17 non-null     category

 5   Education         17 non-null     category

 6   EdCode            17 non-null     int64  

 

 

Statistics of location for ordinal data

 

The median can be used as a statistic for location for ordinal data.  Other quantiles (25th percentile, and so on) also make sense for ordinal values.  Usually, the mean and standard deviation shouldn’t be used.

 

Data['EdCode'].median()

 

3.0

 

### Remember that 3 meant Bachelor’s degree in our coding

 

 

Summary = Data['EdCode'].describe()

 

print(Summary)

 

          EdCode

count  17.000000

mean    3.117647

std     1.166316

min     1.000000

25%     3.000000

50%     3.000000

75%     4.000000

max     5.000000

 

### Remember that 1 meant High school, 3 meant Bachelor’s,
###   4
meant Master’s, and 5 meant Ph.D.

 

### Remember to ignore the mean and standard deviation values for ordinal data.

 

 

Statistics of variation for ordinal data

 

The interquartile range is the difference between the 25th percentile and the 75th percentile.  For ordinal data, the interquartile range itself may not make sense, since it requires taking a difference of ordinal values.  However, identifying the 25th and 75th percentiles is useful, as 50% of observations fall within these values.

 

In this example, the 25th percentile corresponds to Bachelor’s, and the 75th percentile corresponds to Master’s.

 

Statistics for grouped ordinal data

 

Summary = Data[['Gender', 'EdCode']].groupby('Gender').describe()

 

print(Summary)

 

       EdCode

        count mean       std  min   25%  50%   75%  max

Gender                                                

female   10.0  3.5  0.971825  2.0  3.00  3.0  4.00  5.0

male      6.0  2.5  1.378405  1.0  1.25  2.5  3.75  4.0

other     1.0  3.0       NaN  3.0  3.00  3.0  3.00  3.0

 

### Remember that 1 meant High school, 3 meant Bachelor’s,
###   4
meant Master’s, and 5 meant Ph.D.

 

### Remember to ignore the mean and standard deviation values for ordinal data.

 

 

Summary = Data[['Gender', 'County', 'EdCode']].groupby(['Gender', 'County']).describe()

 

print(Summary)

 

                 EdCode                                           

                  count      mean      std  min  25%  50%  75%  max

Gender County                                                     

female Bear Lake    3.0  4.666667  0.57735  4.0  4.5  5.0  5.0  5.0

       Elwood       7.0  3.000000  0.57735  2.0  3.0  3.0  3.0  4.0

male   Bear Lake    1.0  4.000000      NaN  4.0  4.0  4.0  4.0  4.0

       Elwood       5.0  2.200000  1.30384  1.0  1.0  2.0  3.0  4.0

other  Elwood       1.0  3.000000      NaN  3.0  3.0  3.0  3.0  3.0

 

### Remember that 1 meant High school, 3 meant Bachelor’s,
###   4
meant Master’s, and 5 meant Ph.D.

 

### Remember to ignore the mean and standard deviation values for ordinal data.

 

 

Statistics for ordinal data treated as nominal data

 

Ordinal data can also be summarized with the methods used for nominal data.

 

If the levels of the categorical variables are not already in the correct order, it is useful to reorder the variables.

 

EducationLevels =["HS", "AA", "BA", "MA", "PHD" ]

 

Data['Education'] = Data['Education'].cat.reorder_categories(EducationLevels)

 

print(Data['Education'].cat.categories)

 

Index(['HS', 'AA', 'BA', 'MA', 'PHD'], dtype='object')

 

 

Some basic information for categorical variables can be obtained with describe in pandas.

 

Summary = Data.describe(include='category')

 

print(Summary)

 

              Date   Instructor  Gender  County Education

count           17           17      17      17        17

unique           2            2       3       2         5

top     2015-11-01  Arthur Read  female  Elwood        BA

freq            10           10      10      13         7

 

 

The counts for each level of a categorical variable can be found with the value.counts function.

 

Data['Education'].value_counts()

 

BA     7

MA     4

HS     2

AA     2

PHD    2

 

But note that the levels of Education are alphabetized in the output.  We can indicate the order of the categories in the output with the reindex function, adding fill_value for the value for any empty cells.

 

Data['Education'].value_counts().reindex(['HS','AA','BA','MA','PHD'], fill_value=0)

 

HS     2

AA     2

BA     7

MA     4

PHD    2

 

 

The crosstab function in pandas can be used among combinations of categorical variables.

 

For one variable, the Count in index='Count' is just a placeholder for the index for rows.

 

Table = pd.crosstab(columns=Data['Education'], index='Count')

 

print(Table)

 

Education  HS  AA  BA  MA  PHD

row_0                        

Count       2   2   7   4    2

 

 

The normalize option can be used to report the proportions.  Here, normalize='index' is used to report the proportion within rows.

 

Table = pd.crosstab(columns=Data['Education'], index='Count', normalize='index')

 

print(Table)

 

Education        HS        AA        BA        MA       PHD

row_0                                                     

Count      0.117647  0.117647  0.411765  0.235294  0.117647

 

 

Grouped data

 

Table = pd.crosstab(columns=Data['Education'], index=Data['Gender'])

 

print(Table)

 

Education  HS  AA  BA  MA  PHD

Gender                       

female      0   1   5   2    2

male        2   1   1   2    0

other       0   0   1   0    0

 

 

Note, here, that normalize='index' is used to report the proportion within rows. Other options are 'columns' and 'all' to determine the proportion within columns or within the table.

 

Table = pd.crosstab(columns=Data['Education'], index=Data['Gender'], normalize='index')

 

print(Table)

 

Education        HS        AA        BA        MA  PHD

Gender                                               

female     0.000000  0.100000  0.500000  0.200000  0.2

male       0.333333  0.166667  0.166667  0.333333  0.0

other      0.000000  0.000000  1.000000  0.000000  0.0

 

 

Two-way grouped data

 

When the outputted table would have three or more dimensions, pandas reports it in long format.

 

Table = Data[['County', 'Gender', 'Education']].groupby(['County', 'Gender']).value_counts()

 

print(Table)

 

County     Gender  Education

Bear Lake  female  PHD          2

                   MA           1

                   HS           0

                   AA           0

                   BA           0

           male    MA           1

                   PHD          0

                   BA           0

                   AA           0

                   HS           0

           other   PHD          0

                   MA           0

                   BA           0

                   AA           0

                   HS           0

Elwood     female  BA           5

                   MA           1

                   AA           1

                   PHD          0

                   HS           0

           male    HS           2

                   AA           1

                   BA           1

                   MA           1

                   PHD          0

           other   BA           1

                   HS           0

                   AA           0

                   MA           0

                   PHD          0

 

 

Another approach would be to separate out the levels of one of the variables to produce multiple two-dimensional tables.  Here, the data are subsetted to include only Bear Lake County.  The dropna=False option retains the columns or rows will all 0 counts.

 

DataBear = Data.loc[Data['County'] == 'Bear Lake']

 

TableBear = pd.crosstab(columns=DataBear['Education'], index=Data['Gender'],

                         dropna=False)

 

print(TableBear)

 

Education  HS  AA  BA  MA  PHD

Gender                       

female      0   0   0   1    2

male        0   0   0   1    0

other       0   0   0   0    0

 

 

DataElwood = Data.loc[Data['County'] == 'Elwood']

 

TableElwood = pd.crosstab(columns=DataElwood['Education'], index=Data['Gender'],

                         dropna=False)

 

print(TableElwood)

 

Education  HS  AA  BA  MA  PHD

Gender                       

female      0   1   5   1    0

male        2   1   1   1    0

other       0   0   1   0    0

 

 

Descriptive statistics for nominal data

 

Example of descriptive statistics for nominal data

 

This example will use the same Arthur and Buster data set used above.  The analysis here will be similar to the analysis for treating ordinal data as nominal above.  There are variants of some analyses in that section.

 

Data = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""

 

Date          Instructor     Student  Gender  County       Education  EdCode

2015-11-01  "Arthur Read"    a        female  "Elwood"     BA         3

2015-11-01  "Arthur Read"    b        female  "Bear Lake"  PHD        5

2015-11-01  "Arthur Read"    c        male    "Elwood"     BA         3

2015-11-01  "Arthur Read"    d        female  "Elwood"     MA         4

2015-11-01  "Arthur Read"    e        male    "Elwood"     HS         1

2015-11-01  "Arthur Read"    f        female  "Bear Lake"  MA         4

2015-11-01  "Arthur Read"    g        male    "Elwood"     HS         1

2015-11-01  "Arthur Read"    h        female  "Elwood"     BA         3

2015-11-01  "Arthur Read"    i        female  "Elwood"     BA         3

2015-11-01  "Arthur Read"    j        female  "Elwood"     BA         3

2015-12-01  "Buster Baxter"  k        male    "Elwood"     MA         4

2015-12-01  "Buster Baxter"  l        male    "Bear Lake"  MA         4

2015-12-01  "Buster Baxter"  m        female  "Elwood"     AA         2

2015-12-01  "Buster Baxter"  n        male    "Elwood"     AA         2

2015-12-01  "Buster Baxter"  o        other   "Elwood"     BA         3

2015-12-01  "Buster Baxter"  p        female  "Elwood"     BA         3

2015-12-01  "Buster Baxter"  q        female  "Bear Lake"  PHD        5

"""))

 

 

### Treat Date, Instructor, Gender, County, and Education as category type

 

Data['Date']       = Data['Date'].astype('category')

Data['Instructor'] = Data['Instructor'].astype('category')

Data['Gender']     = Data['Gender'].astype('category')

Data['County']     = Data['County'].astype('category')

Data['Education']  = Data['Education'].astype('category')

 

print(Data.info())

 

#   Column      Non-Null Count  Dtype  

---  ------      --------------  -----  

 0   Date        17 non-null     category

 1   Instructor  17 non-null     category

 2   Student     17 non-null     object 

 3   Gender      17 non-null     category

 4   County      17 non-null     category

 5   Education   17 non-null     category

 6   EdCode      17 non-null     int64  

dtypes: category(5), int64(1), object(1)

 

 

Some basic information for categorical variables can be obtained with describe in pandas.

 

Summary = Data.describe(include='category')

 

print(Summary)

 

              Date   Instructor  Gender  County Education

count           17           17      17      17        17

unique           2            2       3       2         5

top     2015-11-01  Arthur Read  female  Elwood        BA

freq            10           10      10      13         7

 

 

One-sample data


Table = pd.crosstab(columns=Data['Gender'], index='Count')

 

print(Table)

 

Gender  female  male  other

row_0                     

Count       10     6      1


### Counts of each level of Gender

 

 

Table = pd.crosstab(columns=Data['Gender'], index='Count', normalize='index')

 

print(Table)

 

Gender    female      male     other

row_0                              

Count   0.588235  0.352941  0.058824

 

### Proportion of each level Gender for each row



One-way data

 

Table = pd.crosstab(columns=Data['Gender'], index=Data['Date'])

 

print(Table)

 

Gender      female  male  other

Date                          

2015-11-01       7     3      0

2015-12-01       3     3      1

 

### Counts for Gender within each level of Date



TableProp = pd.crosstab(columns=Data['Gender'], index=Data['Date'], normalize='index')

 

print(TableProp)

 

Gender        female      male     other

Date                                   

2015-11-01  0.700000  0.300000  0.000000

2015-12-01  0.428571  0.428571  0.142857

### Proportion of each level of Gender for each row


sum(Table.sum())

 

17

 

### Sum of observations in the table

 

 

Table.sum(axis=1)

 

Date

2015-11-01    10

2015-12-01     7

 

### Sum of observations in each row of the table

 

 

Table.sum(axis=0)

 

Gender

female    10

male       6

other      1

 

### Sum of observations in each column of the table

 

 

Two-way data


Table = Data[['Date', 'County', 'Gender']].groupby(['Date', 'County']).value_counts()

 

print(Table)

 

   ### Note the placement of Date and County in the groupby call,

   ###  and the resultant placement in the output.

 

Date        County     Gender

2015-11-01  Bear Lake  female    2

                       male      0

                       other     0

            Elwood     female    5

                       male      3

                       other     0

2015-12-01  Bear Lake  female    1

                       male      1

                       other     0

            Elwood     female    2

                       male      2

                       other     1



Levels for category variables

 

By default, Python will alphabetize the levels of a category variable.  The order of the levels can be changed, for example, to change the order that groups are plotted in a plot, or which groups are at the top of the table.

 

Revisiting the Arthur and Baxter example, we’ll explicitly change the category variables from type object to type category

 

Data = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""

 

Date          Instructor     Student  Gender  County       Education  EdCode

2015-11-01  "Arthur Read"    a        female  "Elwood"     BA         3

2015-11-01  "Arthur Read"    b        female  "Bear Lake"  PHD        5

2015-11-01  "Arthur Read"    c        male    "Elwood"     BA         3

2015-11-01  "Arthur Read"    d        female  "Elwood"     MA         4

2015-11-01  "Arthur Read"    e        male    "Elwood"     HS         1

2015-11-01  "Arthur Read"    f        female  "Bear Lake"  MA         4

2015-11-01  "Arthur Read"    g        male    "Elwood"     HS         1

2015-11-01  "Arthur Read"    h        female  "Elwood"     BA         3

2015-11-01  "Arthur Read"    i        female  "Elwood"     BA         3

2015-11-01  "Arthur Read"    j        female  "Elwood"     BA         3

2015-12-01  "Buster Baxter"  k        male    "Elwood"     MA         4

2015-12-01  "Buster Baxter"  l        male    "Bear Lake"  MA         4

2015-12-01  "Buster Baxter"  m        female  "Elwood"     AA         2

2015-12-01  "Buster Baxter"  n        male    "Elwood"     AA         2

2015-12-01  "Buster Baxter"  o        other   "Elwood"     BA         3

2015-12-01  "Buster Baxter"  p        female  "Elwood"     BA         3

2015-12-01  "Buster Baxter"  q        female  "Bear Lake"  PHD        5

"""))

 

 

### Treat Date, Instructor, Gender, County, and Education as category type

 

Data['Date']       = Data['Date'].astype('category')

Data['Instructor'] = Data['Instructor'].astype('category')

Data['Gender']     = Data['Gender'].astype('category')

Data['County']     = Data['County'].astype('category')

Data['Education']  = Data['Education'].astype('category')

 

print(Data.info())

 

#   Column      Non-Null Count  Dtype  

---  ------      --------------  -----  

 0   Date        17 non-null     category

 1   Instructor  17 non-null     category

 2   Student     17 non-null     object 

 3   Gender      17 non-null     category

 4   County      17 non-null     category

 5   Education   17 non-null     category

 6   EdCode      17 non-null     int64  

 

 

Below, note that the levels of the factor variables were alphabetized by default.  That is, even though Elwood was found in the data before Bear Lake, Python treats Bear Lake as the first level in the variable County.

 

print(Data['County'].cat.categories)

 

Index(['Bear Lake', 'Elwood'], dtype='object')

 

 

We order factor levels in an order of our choosing.

 

print(Data['Education'].cat.categories)

 

Index(['AA', 'BA', 'HS', 'MA', 'PHD'], dtype='object')

 

 

EducationLevels = ["HS", "AA", "BA", "MA", "PHD"]

 

Data['Education'] = Data['Education'].cat.reorder_categories(EducationLevels)

 

print(Data['Education'].cat.categories)

 

Index(['HS', 'AA', 'BA', 'MA', 'PHD'], dtype='object')



We can also indicate a category variable should be treated as ordered categories.

 

EducationLevels =["HS", "AA", "BA", "MA", "PHD" ]

 

Data['EducationOrdered'] = Data['Education'].astype(CategoricalDtype(

                             categories=EducationLevels,

                             ordered=True))

 

print(Data['EducationOrdered'])

 

Name: Education, dtype: category

Categories (5, object): ['HS' < 'AA' < 'BA' < 'MA' < 'PHD']



Note that in the actions above, we are not changing the order in the data frame, simply which level is treated internally as first category, and so on.

 

Optional analyses

 

Robust estimators:  trimmed mean and Winsorized mean

 

In this example, the means and medians for Team A and Team B are identical (mean of 70 and median of 80).  However, the trimmed mean for Team A is less than for Team B (77 vs. 80), and the Winsorized mean for A is less than for Team B (75 vs. 78).  According to either the trimmed mean method or the Winsorized mean method, Team B had a higher score when extreme observations were removed or replaced.


TeamA = pd.Series([100, 90, 80, 60, 20])

 

TeamB = pd.Series([80, 80, 80, 80, 30])

 

 

print(np.mean(TeamA))

 

70.0

 

 

print(np.median(TeamA))

 

80.0

 

 

The trimmed mean can be calculated with the trim_mean function in SciPy stats. Here, both the largest 20% and smallest 20% of observations are trimmed from the array.

 

stats.trim_mean(TeamA, proportiontocut = 0.20)

 

76.66666666666667

 

 

The trimmed array can be displayed by subsetting the array.

 

TrimA = TeamA[(TeamA > np.percentile(TeamA, 20)) & (TeamA < np.percentile(TeamA, 80))]

 

print(TrimA)

 

1    90

2    80

3    60

 

 

print(np.mean(TrimA))

 

76.66666666666667

 

### The trimmed mean at 20%.

 

 

The clip function in NumPy wisorizes the data. Here the lowest 20% of observations are replaced with the value of the 20th percentile, and the highest 20% of observations are replaced with the value of the 80th percentile.  Different methods can be used for the calculation of the percentiles.

 

WinsorA = np.clip(TeamA, np.percentile(TeamA, 20), np.percentile(TeamA, 80))

 

print(WinsorA)

 

0    92

1    90

2    80

3    60

4    52

 

### This uses the same method for percentiles as does R by default.

 

 

print(np.mean(WinsorA))

 

74.8

 

### The wisorized mean at 20%.

 

 

scipy.stats.mstats.winsorize will winsorize the data with simpler syntax, but it doesn’t offer options for how the percentiles are calculated.  So, here the result is different.  For larger continuous samples, there won’t usually be large differences in the results of the two methods.

 

WinsorAAlt = mstats.winsorize(TeamA, limits=[0.2, 0.2])

 

print(WinsorAAlt)

 

[90 90 80 60 60]

 

 

print(np.mean(WinsorAAlt))

 

76.0

 

 

The following code replicates the analysis for Team B.

 

TeamB = pd.Series([80, 80, 80, 80, 30])

 

print(np.median(TeamB))

 

80.0

 

 

print(np.mean(TeamB))

 

70.0

 

 

stats.trim_mean(TeamB, proportiontocut = 0.20)

 

80.0

 

### Trimmed mean for Team B at 20%

 

 

WinsorB = np.clip(TeamB, np.percentile(TeamB, 20), np.percentile(TeamB, 80))

 

print(WinsorB)

 

0    80

1    80

2    80

3    80

4    70

 

 

print(np.mean(WinsorB))

 

78.0

 

### Winsorized mean for Team B at 20%

 

 

Geometric mean

 

The geometric mean can be calculated with the gmean function in SciPy stats.  It can also be calculated with a nested log, mean, and exp functions in NumPy.

 

Bacteria example

 

Bacteria = pd.Series([20, 40, 50, 60, 100, 120, 150, 200, 1000])

 

print(stats.gmean(Bacteria))

 

98.38886970015734

 

 

np.exp(np.mean(np.log(Bacteria)))

 

98.38886970015734

 

 

Annual return example

 

Note that with the annual return example, that 1 must be added to all the values, and then subtracted at the end to return the values to proportion of return.

 

Return = pd.Series([-0.80, 0.20, 0.30, 0.30, 0.20, 0.30])

 

print(stats.gmean(Return+1)-1)

 

-0.07344666385976306

 

Harmonic mean

 

The harmonic mean can be calculated with the hmean function in SciPy stats.  It can also be calculated by taking the inverse of each value, averaging these values, and reporting the inverse of this result.

 

Speed = pd.Series([20, 40, 50, 60, 100, 120, 150, 200, 1000])

 

print(stats.hmean(Speed))

 

63.08411214953271

 

 

print(1/np.mean(1/Speed))

 

63.08411214953271