Descriptive statistics in SAEPER
For a discussion of various descriptive statistics, see the corresponding chapter in Summary and Analysis of Extension Program Evaluation in R (rcompanion.org/handbook/C_02.html).
Importing packages in this chapter
The following commands will import required packages used in this chapter from libraries and assign them common aliases. You may need install these libraries first.
import io
import pandas as pd
import scipy.stats.mstats as mstats
from pandas.api.types import CategoricalDtype
Descriptive statistics for interval/ratio data
For this example, imagine that Ren and Stimpy have each held eight workshops educating the public about water conservation at home. They are interested in how many people showed up to the workshops.
Data = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""
Instructor Location Attendees
Ren North 7
Ren North 22
Ren North 6
Ren North 15
Ren South 12
Ren South 13
Ren South 14
Ren South 16
Stimpy North 18
Stimpy North 17
Stimpy North 15
Stimpy North 9
Stimpy South 15
Stimpy South 11
Stimpy South 19
Stimpy South 23
"""))
### Convert Instructor and Location to category type
Data['Instructor'] = Data['Instructor'].astype('category')
Data['Location'] = Data['Location'].astype('category')
### Display some summary statistics for the data frame
print(Data.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Instructor 16 non-null category
1 Location 16 non-null category
2 Attendees 16 non-null int64
print(Data.describe())
Attendees
count 16.000000
mean 14.500000
std 4.830459
min 6.000000
25% 11.750000
50% 15.000000
75% 17.250000
max 23.000000
print(Data.describe(include='category'))
Instructor Location
count 16 16
unique 2 2
top Ren North
freq 8 8
Functions for sum and count
For a pandas data frame, the sum of a variable can be found with the sum function, and the number of observations can be found with the count function.
Because the data are housed in a data frame, we can use the convention Data['Attendees'] to access the variable Attendees within the data frame Data.
Data['Attendees'].sum()
232
Data['Attendees'].count()
16
Statistics of location for interval/ratio data
Mean
The mean is simply the sum of the values divided by the number of values.
Data['Attendees'].sum() / Data['Attendees'].count()
14.5
The mean function in pandas will return the mean.
Data['Attendees'].mean()
14.5
Here, with skewed income data, the mean may not be the best representative of the center of the data.
IncomeList = [49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000]
Income = pd.Series(IncomeList)
Income.mean()
233300.0
Median
The pandas package can also calculate the median for a variable in a data frame or for an array.
Data['Attendees'].median()
15.0
For individual arrays, we can use pd.Series to define the array.
Attendees2 = pd.Series([7, 22, 6, 15, 12, 13, 14, 16, 18, 17, 15, 9, 15, 11, 19, 1000])
Attendees2.median()
15.0
In the following example, we’ll change Stimpy’s last Attendees value from 23 to 1000 to show that this doesn’t affect the median.
Attendees2 = pd.Series([7, 22, 6, 15, 12, 13, 14, 16, 18, 17, 15, 9, 15, 11, 19, 1000])
Attendees2.median()
15.0
The median income for the town discussed above is $40,500. Half the families in the town have an income above this amount, and half have an income below this amount. Note that this differs considerably from the mean calculated above.
Income = pd.Series(
[49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000])
Income.median()
40500.0
Mode
For our Ren and Stimpy example, the value 15 occurs three times in Attendees and so is the mode.
Data['Attendees'].mode()
15
Statistics of variation for interval/ratio data
Standard deviation
Data['Attendees'].std()
4.83045891539648
Income = pd.Series([49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000])
Income.std()
620834.5727056981
Median absolute deviation
Median absolute deviation for variable in a data frame or for an array can be calculated using the stats package from the SciPy module. The scale option is used to adjust the result. For example to return a result that is similar to the standard deviation for normal samples. scale='normal' is equivalent to scale= 0.67449.
stats.median_abs_deviation(Data['Attendees'], scale='normal')
4.447805008228439
stats.median_abs_deviation(Data['Attendees'])
3.0
Income = pd.Series([49000, 44000, 25000, 18000, 32000, 47000, 37000, 45000, 36000, 2000000])
stats.median_abs_deviation(Income, scale='normal')
11119.512520571097
stats.median_abs_deviation(Income)
7500.0
Standard error of the mean
The standard error of the mean can be calculated as the standard deviation divided by the square root of the number of observations. It can also be calculated with SciPy stats.
SD = Data['Attendees'].std()
Length = Data['Attendees'].count()
print(SD / np.sqrt(Length))
1.20761472884912
stats.sem(Data['Attendees'])
1.20761472884912
Five-number summary, quartiles, percentiles
The five-number summary, consisting of the minimum, 25th percentile, median, 75th percentile, and maximum, can be calculated with the describe function in pandas.
Summary = Data['Attendees'].describe()
print(Summary)
count 16.000000
mean 14.500000
std 4.830459
min 6.000000
25% 11.750000
50% 15.000000
75% 17.250000
max 23.000000
Quantiles can also be calculated with the quantile function in pandas and with the quantile function in NumPy. The function in NumPy has more options for estimation methods for quantiles.
Data['Attendees'].quantile(0.25)
11.75
### Note that the default method matches the result from the default in R.
Data['Attendees'].quantile(0.25, interpolation='midpoint')
11.5
### In this case the 'midpoint' option reports the midpoint between 11 and 12,
### the 4th and 5th observations when Attendees is sorted.
np.quantile(Data['Attendees'], 0.25)
11.75
np.quantile(Data['Attendees'], 0.25, method='midpoint')
11.5
Calculate the 95th percentile.
Data['Attendees'].quantile(0.95)
22.25
Statistics for grouped interval/ratio data
Continuing to examine the Ren and Stimpy data, summary statistics by group and by combination of groups can be obtained with the describe and groupby functions in pandas.
Summary = Data.groupby('Instructor')['Attendees'].describe()
print(Summary)
count mean std min 25% 50% 75% max
Instructor
Ren 8.0 13.125 5.083236 6.0 10.75 13.5 15.25 22.0
Stimpy 8.0 15.875 4.454131 9.0 14.00 16.0 18.25 23.0
Summary2 = Data.groupby(['Instructor', 'Location'])['Attendees'].describe()
print(Summary2)
count mean std min 25% 50% 75% max
Instructor Location
Ren North 4.0 12.50 7.505553 6.0 6.75 11.0 16.75 22.0
South 4.0 13.75 1.707825 12.0 12.75 13.5 14.50 16.0
Stimpy North 4.0 14.75 4.031129 9.0 13.50 16.0 17.25 18.0
South 4.0 17.00 5.163978 11.0 14.00 17.0 20.00 23.0
Adding variables to a data frame
For this example, we can add the standard error of the mean to summary statistics of the grouped data.
Summary2['sem'] = Summary2['std'] / np.sqrt(Summary2['count'])
print(Summary2)
count mean std ... 75% max sem
Instructor Location ...
Ren North 4.0 12.50 7.505553 ... 16.75 22.0 3.752777
South 4.0 13.75 1.707825 ... 14.50 16.0 0.853913
Stimpy North 4.0 14.75 4.031129 ... 17.25 18.0 2.015564
South 4.0 17.00 5.163978 ... 20.00 23.0 2.581989
Other statistics for grouped variables
The groupby function in pandas can be used to calculate summary statistics not included in describe. Here, the 95th percentile is calculated for each combination of Instructor and Location.
Summary3 = Data.groupby(['Instructor', 'Location'])['Attendees'].quantile(0.95)
print(Summary3)
Instructor Location
Ren North 20.95
South 15.70
Stimpy North 17.85
South 22.40
Summary3 in this case a series in pandas. We can change the name of variable for convenience.
print(Summary3.info)
Name: Attendees, dtype: float64>
Summary3.name = '95thPercentile'
print(Summary3)
Instructor Location
Ren North 20.95
South 15.70
Stimpy North 17.85
South 22.40
Name: 95thPercentile, dtype: float64
We can also convert Summary3 to a data frame.
Summary4 = pd.DataFrame(Summary3)
print(Summary4)
95thPercentile
Instructor Location
Ren North 20.95
South 15.70
Stimpy North 17.85
South 22.40
Summaries for data frames
Summary statistics for variables in a data frame can be obtained with the pandas functions info and describe.
print(Data.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Instructor 16 non-null category
1 Location 16 non-null category
2 Attendees 16 non-null int64
print(Data.describe())
Attendees
count 16.000000
mean 14.500000
std 4.830459
min 6.000000
25% 11.750000
50% 15.000000
75% 17.250000
max 23.000000
Dealing with missing values
For this example, we’ll change two values of Attendees to missing values. Here, the missing values are coded with NA, which is the convention in R. read_table will convert them to nan values, which is the usual convention in Python for missing values or other values that are “not a number”.
Note that in the summary from the info call, that there are 14 non-null values for Attendees.
Data2 = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""
Instructor Location Attendees
Ren North 7
Ren North 22
Ren North 6
Ren North 15
Ren South 12
Ren South 13
Ren South NA
Ren South 16
Stimpy North 18
Stimpy North 17
Stimpy North NA
Stimpy North 9
Stimpy South 15
Stimpy South 11
Stimpy South 19
Stimpy South 23
"""))
print(Data2.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Instructor 16 non-null object
1 Location 16 non-null object
2 Attendees 14 non-null float64
By default, many pandas functions will remove nan values by default. If nan values are included in a calculation, the returned result will often be nan.
Data2['Attendees'].mean()
14.5
Data2['Attendees'].mean(skipna=False)
nan
Removing missing values
The pandas function dropna can remove rows with missing values. Observations with no missing values for any variable are sometimes called “complete cases”. The process is sometimes called “listwise deletion”.
Note here that the data frame DataComplete has 14 observations. The two rows with nan observations in Attendees have been removed from the data frame.
DataComplete = Data2.dropna()
print(DataComplete.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Instructor 14 non-null object
1 Location 14 non-null object
2 Attendees 14 non-null float64
The dropna function can also remove any columns, or variables, that have missing values. Note here that the variable Attendees has been removed entirely, but Instructor and Location retain 16 observations.
DataCompleteColumns = Data2.dropna(axis=1)
print(DataCompleteColumns.info())
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Instructor 16 non-null object
1 Location 16 non-null object
Counting missing values
The isnull function will return True if there are any missing values in a data frame or column in a data frame.
Data2.isnull().values.any()
True
Data2['Attendees'].isnull().values.any()
True
Data2['Location'].isnull().values.any()
False
Likewise, the isnull and sum functions can be combined to count the missing values in a data frame or variable in a data frame.
Data2.isnull().sum()
Instructor 0
Location 0
Attendees 2
### Number of nan values in each variable
Data2['Attendees'].isnull().sum()
2
### Number of nan values in Attendees in Data2
Data2['Location'].isnull().sum()
0
### Number of nan values in Location in Data2
Missing values in arrays
For a pd.Series, missing values can be designated with None.
Attendees2 = pd.Series([7, 22, 6, 15, 12, 13, None, 16, 18, 17, None, 9, 15, 11, 19, 23])
print(Attendees2.info())
RangeIndex: 16 entries, 0 to 15
Series name: None
Non-Null Count Dtype
-------------- -----
14 non-null float64
### 16 values, 14 of which are non-null
print(Attendees2.isnull().sum())
2
### 2 null values
print(Attendees2.mean())
14.5
print(Attendees2.mean(skipna=False))
nan
Statistics of shape for interval/ratio data
Returning to the original Ren and Stimpy data set, we can obtain the skewness and kurtosis for Attendees with the skew and kurtosis functions in pandas.
There are different methods to calculate skew and kurtosis, so different software may report different results by default. See the documentation for details. It appears that pandas calculates these statistics in the manner of type=2 in R with psych::describe.
print(Data['Attendees'].skew())
-0.0506986941557648
print(Data['Attendees'].kurtosis())
-0.32657277416461117
Descriptive statistics for ordinal data
To obtain descriptive statistics for ordinal data, the data can be examined as either numeric data or nominal data. In Python, the data could also be coded as ordered category data.
Example of descriptive statistics for ordinal data
For this example, Arthur and Baxter have each held a workshop educating the public about preventing water pollution at home. They are interested in the education level of people who attended the workshops. We can consider education level an ordinal variable. They also collected information on the gender and county of participants.
For this data, we have manually coded Education with numbers in a separate variable Ed.code. These numbers list the education in order: High school < Associate’s < Bachelor’s < Master’s < Ph.D.
Data = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""
Date Instructor Student Gender County Education EdCode
2015-11-01 "Arthur Read" a female "Elwood" BA 3
2015-11-01 "Arthur Read" b female "Bear Lake" PHD 5
2015-11-01 "Arthur Read" c male "Elwood" BA 3
2015-11-01 "Arthur Read" d female "Elwood" MA 4
2015-11-01 "Arthur Read" e male "Elwood" HS 1
2015-11-01 "Arthur Read" f female "Bear Lake" MA 4
2015-11-01 "Arthur Read" g male "Elwood" HS 1
2015-11-01 "Arthur Read" h female "Elwood" BA 3
2015-11-01 "Arthur Read" i female "Elwood" BA 3
2015-11-01 "Arthur Read" j female "Elwood" BA 3
2015-12-01 "Buster Baxter" k male "Elwood" MA 4
2015-12-01 "Buster Baxter" l male "Bear Lake" MA 4
2015-12-01 "Buster Baxter" m female "Elwood" AA 2
2015-12-01 "Buster Baxter" n male "Elwood" AA 2
2015-12-01 "Buster Baxter" o other "Elwood" BA 3
2015-12-01 "Buster Baxter" p female "Elwood" BA 3
2015-12-01 "Buster Baxter" q female "Bear Lake" PHD 5
"""))
### Treat Date, Instructor, Gender, County, and Education as category type
Data['Date'] = Data['Date'].astype('category')
Data['Instructor'] = Data['Instructor'].astype('category')
Data['Gender'] = Data['Gender'].astype('category')
Data['County'] = Data['County'].astype('category')
Data['Education'] = Data['Education'].astype('category')
print(Data.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 17 non-null category
1 Instructor 17 non-null category
2 Student 17 non-null object
3 Gender 17 non-null category
4 County 17 non-null category
5 Education 17 non-null category
6 EdCode 17 non-null int64
By default, the order of categories will be alphabetized.
print(Data['Education'].cat.categories)
Index(['AA', 'BA', 'HS', 'MA', 'PHD'], dtype='object')
The levels of categories can be re-ordered. This is useful for examining output and plots.
EducationLevels =["HS", "AA", "BA", "MA", "PHD" ]
Data['Education'] = Data['Education'].cat.reorder_categories(EducationLevels)
print(Data['Education'].cat.categories)
Index(['HS', 'AA', 'BA', 'MA', 'PHD'], dtype='object')
Optional code to assign numeric values to a variable based on a category variable
Here, EdCode was already included as a numeric equivalent to Education. But we could create a numeric variable to reflect the numeric equivalent of Education if it weren’t already included in the data frame.
Here, we'll create a new variable EdCode2 with the numeric assignment for Education.
This can be accomplished by creating a dictionary, here called Mapper.
Mapper = {"HS":1, "AA":2, "BA":3, "MA":4, "PHD":5}
Data["EdCode2"] = Data["Education"].replace(Mapper)
Data['EdCode2'] = Data['EdCode2'].astype('int64')
print(Data.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 17 non-null category
1 Instructor 17 non-null category
2 Student 17 non-null object
3 Gender 17 non-null category
4 County 17 non-null category
5 Education 17 non-null category
6 EdCode 17 non-null int64
7 EdCode2 17 non-null int64
This could also be accomplished by manually defining a new variable, EdCode3, based on the value of the variable Education.
Data['EdCode3'] = 9999
Data.loc[Data['Education'] == 'HS', 'EdCode3'] = 1
Data.loc[Data['Education'] == 'AA', 'EdCode3'] = 2
Data.loc[Data['Education'] == 'BA', 'EdCode3'] = 3
Data.loc[Data['Education'] == 'MA', 'EdCode3'] = 4
Data.loc[Data['Education'] == 'PHD', 'EdCode3'] = 5
print(Data.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 17 non-null category
1 Instructor 17 non-null category
2 Student 17 non-null object
3 Gender 17 non-null category
4 County 17 non-null category
5 Education 17 non-null category
6 EdCode 17 non-null int64
7 EdCode2 17 non-null int64
8 EdCode3 17 non-null int64
Optional code to change a category variable to an ordered category variable
Ordinal variables can be explicitly designated as ordered category variables.
Note that when the array for EducationOrdered is printed, the order is indicated with less than symbols, e.g. 'HS' < 'AA'.
EducationLevels =["HS", "AA", "BA", "MA", "PHD" ]
Data['EducationOrdered'] = Data['Education'].astype(CategoricalDtype(
categories=EducationLevels,
ordered=True))
print(Data['EducationOrdered'])
Name: Education, dtype: category
Categories (5, object): ['HS' < 'AA' < 'BA' < 'MA' < 'PHD']
Dropping unused variables
### Drop EdCode2 and EdCode3
Data = Data.drop(columns=['EdCode2', 'EdCode3', 'EducationOrdered'])
print(Data.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 17 non-null category
1 Instructor 17 non-null category
2 Student 17 non-null object
3 Gender 17 non-null category
4 County 17 non-null category
5 Education 17 non-null category
6 EdCode 17 non-null int64
Statistics of location for ordinal data
The median can be used as a statistic for location for ordinal data. Other quantiles (25th percentile, and so on) also make sense for ordinal values. Usually, the mean and standard deviation shouldn’t be used.
Data['EdCode'].median()
3.0
### Remember that 3 meant Bachelor’s degree in our coding
Summary = Data['EdCode'].describe()
print(Summary)
EdCode
count 17.000000
mean 3.117647
std 1.166316
min 1.000000
25% 3.000000
50% 3.000000
75% 4.000000
max 5.000000
### Remember
that 1 meant High school, 3 meant Bachelor’s,
### 4 meant Master’s, and 5 meant Ph.D.
### Remember to ignore the mean and standard deviation values for ordinal data.
Statistics of variation for ordinal data
The interquartile range is the difference between the 25th percentile and the 75th percentile. For ordinal data, the interquartile range itself may not make sense, since it requires taking a difference of ordinal values. However, identifying the 25th and 75th percentiles is useful, as 50% of observations fall within these values.
In this example, the 25th percentile corresponds to Bachelor’s, and the 75th percentile corresponds to Master’s.
Statistics for grouped ordinal data
Summary = Data[['Gender', 'EdCode']].groupby('Gender').describe()
print(Summary)
EdCode
count mean std min 25% 50% 75% max
Gender
female 10.0 3.5 0.971825 2.0 3.00 3.0 4.00 5.0
male 6.0 2.5 1.378405 1.0 1.25 2.5 3.75 4.0
other 1.0 3.0 NaN 3.0 3.00 3.0 3.00 3.0
### Remember
that 1 meant High school, 3 meant Bachelor’s,
### 4 meant Master’s, and 5 meant Ph.D.
### Remember to ignore the mean and standard deviation values for ordinal data.
Summary = Data[['Gender', 'County', 'EdCode']].groupby(['Gender', 'County']).describe()
print(Summary)
EdCode
count mean std min 25% 50% 75% max
Gender County
female Bear Lake 3.0 4.666667 0.57735 4.0 4.5 5.0 5.0 5.0
Elwood 7.0 3.000000 0.57735 2.0 3.0 3.0 3.0 4.0
male Bear Lake 1.0 4.000000 NaN 4.0 4.0 4.0 4.0 4.0
Elwood 5.0 2.200000 1.30384 1.0 1.0 2.0 3.0 4.0
other Elwood 1.0 3.000000 NaN 3.0 3.0 3.0 3.0 3.0
### Remember
that 1 meant High school, 3 meant Bachelor’s,
### 4 meant Master’s, and 5 meant Ph.D.
### Remember to ignore the mean and standard deviation values for ordinal data.
Statistics for ordinal data treated as nominal data
Ordinal data can also be summarized with the methods used for nominal data.
If the levels of the categorical variables are not already in the correct order, it is useful to reorder the variables.
EducationLevels =["HS", "AA", "BA", "MA", "PHD" ]
Data['Education'] = Data['Education'].cat.reorder_categories(EducationLevels)
print(Data['Education'].cat.categories)
Index(['HS', 'AA', 'BA', 'MA', 'PHD'], dtype='object')
Some basic information for categorical variables can be obtained with describe in pandas.
Summary = Data.describe(include='category')
print(Summary)
Date Instructor Gender County Education
count 17 17 17 17 17
unique 2 2 3 2 5
top 2015-11-01 Arthur Read female Elwood BA
freq 10 10 10 13 7
The counts for each level of a categorical variable can be found with the value.counts function.
Data['Education'].value_counts()
BA 7
MA 4
HS 2
AA 2
PHD 2
But note that the levels of Education are alphabetized in the output. We can indicate the order of the categories in the output with the reindex function, adding fill_value for the value for any empty cells.
Data['Education'].value_counts().reindex(['HS','AA','BA','MA','PHD'], fill_value=0)
HS 2
AA 2
BA 7
MA 4
PHD 2
The crosstab function in pandas can be used among combinations of categorical variables.
For one variable, the Count in index='Count' is just a placeholder for the index for rows.
Table = pd.crosstab(columns=Data['Education'], index='Count')
print(Table)
Education HS AA BA MA PHD
row_0
Count 2 2 7 4 2
The normalize option can be used to report the proportions. Here, normalize='index' is used to report the proportion within rows.
Table = pd.crosstab(columns=Data['Education'], index='Count', normalize='index')
print(Table)
Education HS AA BA MA PHD
row_0
Count 0.117647 0.117647 0.411765 0.235294 0.117647
Grouped data
Table = pd.crosstab(columns=Data['Education'], index=Data['Gender'])
print(Table)
Education HS AA BA MA PHD
Gender
female 0 1 5 2 2
male 2 1 1 2 0
other 0 0 1 0 0
Note, here, that normalize='index' is used to report the proportion within rows. Other options are 'columns' and 'all' to determine the proportion within columns or within the table.
Table = pd.crosstab(columns=Data['Education'], index=Data['Gender'], normalize='index')
print(Table)
Education HS AA BA MA PHD
Gender
female 0.000000 0.100000 0.500000 0.200000 0.2
male 0.333333 0.166667 0.166667 0.333333 0.0
other 0.000000 0.000000 1.000000 0.000000 0.0
Two-way grouped data
When the outputted table would have three or more dimensions, pandas reports it in long format.
Table = Data[['County', 'Gender', 'Education']].groupby(['County', 'Gender']).value_counts()
print(Table)
County Gender Education
Bear Lake female PHD 2
MA 1
HS 0
AA 0
BA 0
male MA 1
PHD 0
BA 0
AA 0
HS 0
other PHD 0
MA 0
BA 0
AA 0
HS 0
Elwood female BA 5
MA 1
AA 1
PHD 0
HS 0
male HS 2
AA 1
BA 1
MA 1
PHD 0
other BA 1
HS 0
AA 0
MA 0
PHD 0
Another approach would be to separate out the levels of one of the variables to produce multiple two-dimensional tables. Here, the data are subsetted to include only Bear Lake County. The dropna=False option retains the columns or rows will all 0 counts.
DataBear = Data.loc[Data['County'] == 'Bear Lake']
TableBear = pd.crosstab(columns=DataBear['Education'], index=Data['Gender'],
dropna=False)
print(TableBear)
Education HS AA BA MA PHD
Gender
female 0 0 0 1 2
male 0 0 0 1 0
other 0 0 0 0 0
DataElwood = Data.loc[Data['County'] == 'Elwood']
TableElwood = pd.crosstab(columns=DataElwood['Education'], index=Data['Gender'],
dropna=False)
print(TableElwood)
Education HS AA BA MA PHD
Gender
female 0 1 5 1 0
male 2 1 1 1 0
other 0 0 1 0 0
Descriptive statistics for nominal data
Example of descriptive statistics for nominal data
This example will use the same Arthur and Buster data set used above. The analysis here will be similar to the analysis for treating ordinal data as nominal above. There are variants of some analyses in that section.
Data = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""
Date Instructor Student Gender County Education EdCode
2015-11-01 "Arthur Read" a female "Elwood" BA 3
2015-11-01 "Arthur Read" b female "Bear Lake" PHD 5
2015-11-01 "Arthur Read" c male "Elwood" BA 3
2015-11-01 "Arthur Read" d female "Elwood" MA 4
2015-11-01 "Arthur Read" e male "Elwood" HS 1
2015-11-01 "Arthur Read" f female "Bear Lake" MA 4
2015-11-01 "Arthur Read" g male "Elwood" HS 1
2015-11-01 "Arthur Read" h female "Elwood" BA 3
2015-11-01 "Arthur Read" i female "Elwood" BA 3
2015-11-01 "Arthur Read" j female "Elwood" BA 3
2015-12-01 "Buster Baxter" k male "Elwood" MA 4
2015-12-01 "Buster Baxter" l male "Bear Lake" MA 4
2015-12-01 "Buster Baxter" m female "Elwood" AA 2
2015-12-01 "Buster Baxter" n male "Elwood" AA 2
2015-12-01 "Buster Baxter" o other "Elwood" BA 3
2015-12-01 "Buster Baxter" p female "Elwood" BA 3
2015-12-01 "Buster Baxter" q female "Bear Lake" PHD 5
"""))
### Treat Date, Instructor, Gender, County, and Education as category type
Data['Date'] = Data['Date'].astype('category')
Data['Instructor'] = Data['Instructor'].astype('category')
Data['Gender'] = Data['Gender'].astype('category')
Data['County'] = Data['County'].astype('category')
Data['Education'] = Data['Education'].astype('category')
print(Data.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 17 non-null category
1 Instructor 17 non-null category
2 Student 17 non-null object
3 Gender 17 non-null category
4 County 17 non-null category
5 Education 17 non-null category
6 EdCode 17 non-null int64
dtypes: category(5), int64(1), object(1)
Some basic information for categorical variables can be obtained with describe in pandas.
Summary = Data.describe(include='category')
print(Summary)
Date Instructor Gender County Education
count 17 17 17 17 17
unique 2 2 3 2 5
top 2015-11-01 Arthur Read female Elwood BA
freq 10 10 10 13 7
One-sample data
Table = pd.crosstab(columns=Data['Gender'], index='Count')
print(Table)
Gender female male other
row_0
Count 10 6 1
### Counts of each level of Gender
Table = pd.crosstab(columns=Data['Gender'], index='Count', normalize='index')
print(Table)
Gender female male other
row_0
Count 0.588235 0.352941 0.058824
### Proportion of each level Gender for each row
One-way data
Table = pd.crosstab(columns=Data['Gender'], index=Data['Date'])
print(Table)
Gender female male other
Date
2015-11-01 7 3 0
2015-12-01 3 3 1
### Counts for Gender within each level of Date
TableProp = pd.crosstab(columns=Data['Gender'], index=Data['Date'], normalize='index')
print(TableProp)
Gender female male other
Date
2015-11-01 0.700000 0.300000 0.000000
2015-12-01 0.428571 0.428571 0.142857
### Proportion of each level of Gender for each row
sum(Table.sum())
17
### Sum of observations in the table
Table.sum(axis=1)
Date
2015-11-01 10
2015-12-01 7
### Sum of observations in each row of the table
Table.sum(axis=0)
Gender
female 10
male 6
other 1
### Sum of observations in each column of the table
Two-way data
Table = Data[['Date', 'County', 'Gender']].groupby(['Date', 'County']).value_counts()
print(Table)
### Note the placement of Date and County in the groupby call,
### and the resultant placement in the output.
Date County Gender
2015-11-01 Bear Lake female 2
male 0
other 0
Elwood female 5
male 3
other 0
2015-12-01 Bear Lake female 1
male 1
other 0
Elwood female 2
male 2
other 1
Levels for category variables
By default, Python will alphabetize the levels of a category variable. The order of the levels can be changed, for example, to change the order that groups are plotted in a plot, or which groups are at the top of the table.
Revisiting the Arthur and Baxter example, we’ll explicitly change the category variables from type object to type category.
Data = pd.read_table(sep="\\s+", filepath_or_buffer=io.StringIO("""
Date Instructor Student Gender County Education EdCode
2015-11-01 "Arthur Read" a female "Elwood" BA 3
2015-11-01 "Arthur Read" b female "Bear Lake" PHD 5
2015-11-01 "Arthur Read" c male "Elwood" BA 3
2015-11-01 "Arthur Read" d female "Elwood" MA 4
2015-11-01 "Arthur Read" e male "Elwood" HS 1
2015-11-01 "Arthur Read" f female "Bear Lake" MA 4
2015-11-01 "Arthur Read" g male "Elwood" HS 1
2015-11-01 "Arthur Read" h female "Elwood" BA 3
2015-11-01 "Arthur Read" i female "Elwood" BA 3
2015-11-01 "Arthur Read" j female "Elwood" BA 3
2015-12-01 "Buster Baxter" k male "Elwood" MA 4
2015-12-01 "Buster Baxter" l male "Bear Lake" MA 4
2015-12-01 "Buster Baxter" m female "Elwood" AA 2
2015-12-01 "Buster Baxter" n male "Elwood" AA 2
2015-12-01 "Buster Baxter" o other "Elwood" BA 3
2015-12-01 "Buster Baxter" p female "Elwood" BA 3
2015-12-01 "Buster Baxter" q female "Bear Lake" PHD 5
"""))
### Treat Date, Instructor, Gender, County, and Education as category type
Data['Date'] = Data['Date'].astype('category')
Data['Instructor'] = Data['Instructor'].astype('category')
Data['Gender'] = Data['Gender'].astype('category')
Data['County'] = Data['County'].astype('category')
Data['Education'] = Data['Education'].astype('category')
print(Data.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 17 non-null category
1 Instructor 17 non-null category
2 Student 17 non-null object
3 Gender 17 non-null category
4 County 17 non-null category
5 Education 17 non-null category
6 EdCode 17 non-null int64
Below, note that the levels of the factor variables were alphabetized by default. That is, even though Elwood was found in the data before Bear Lake, Python treats Bear Lake as the first level in the variable County.
print(Data['County'].cat.categories)
Index(['Bear Lake', 'Elwood'], dtype='object')
We order factor levels in an order of our choosing.
print(Data['Education'].cat.categories)
Index(['AA', 'BA', 'HS', 'MA', 'PHD'], dtype='object')
EducationLevels = ["HS", "AA", "BA", "MA", "PHD"]
Data['Education'] = Data['Education'].cat.reorder_categories(EducationLevels)
print(Data['Education'].cat.categories)
Index(['HS', 'AA', 'BA', 'MA', 'PHD'], dtype='object')
We can also indicate a category variable should be treated as ordered categories.
EducationLevels =["HS", "AA", "BA", "MA", "PHD" ]
Data['EducationOrdered'] = Data['Education'].astype(CategoricalDtype(
categories=EducationLevels,
ordered=True))
print(Data['EducationOrdered'])
Name: Education, dtype: category
Categories (5, object): ['HS' < 'AA' < 'BA' < 'MA' < 'PHD']
Note that in the actions above, we are not changing the order in the data frame, simply which level is treated internally as first category, and so on.
Optional analyses
Robust estimators: trimmed mean and Winsorized mean
In this example, the means and medians for Team A and Team B are identical (mean of 70 and median of 80). However, the trimmed mean for Team A is less than for Team B (77 vs. 80), and the Winsorized mean for A is less than for Team B (75 vs. 78). According to either the trimmed mean method or the Winsorized mean method, Team B had a higher score when extreme observations were removed or replaced.
TeamA = pd.Series([100, 90, 80, 60, 20])
TeamB = pd.Series([80, 80, 80, 80, 30])
print(np.mean(TeamA))
70.0
print(np.median(TeamA))
80.0
The trimmed mean can be calculated with the trim_mean function in SciPy stats. Here, both the largest 20% and smallest 20% of observations are trimmed from the array.
stats.trim_mean(TeamA, proportiontocut = 0.20)
76.66666666666667
The trimmed array can be displayed by subsetting the array.
TrimA = TeamA[(TeamA > np.percentile(TeamA, 20)) & (TeamA < np.percentile(TeamA, 80))]
print(TrimA)
1 90
2 80
3 60
print(np.mean(TrimA))
76.66666666666667
### The trimmed mean at 20%.
The clip function in NumPy wisorizes the data. Here the lowest 20% of observations are replaced with the value of the 20th percentile, and the highest 20% of observations are replaced with the value of the 80th percentile. Different methods can be used for the calculation of the percentiles.
WinsorA = np.clip(TeamA, np.percentile(TeamA, 20), np.percentile(TeamA, 80))
print(WinsorA)
0 92
1 90
2 80
3 60
4 52
### This uses the same method for percentiles as does R by default.
print(np.mean(WinsorA))
74.8
### The wisorized mean at 20%.
scipy.stats.mstats.winsorize will winsorize the data with simpler syntax, but it doesn’t offer options for how the percentiles are calculated. So, here the result is different. For larger continuous samples, there won’t usually be large differences in the results of the two methods.
WinsorAAlt = mstats.winsorize(TeamA, limits=[0.2, 0.2])
print(WinsorAAlt)
[90 90 80 60 60]
print(np.mean(WinsorAAlt))
76.0
The following code replicates the analysis for Team B.
TeamB = pd.Series([80, 80, 80, 80, 30])
print(np.median(TeamB))
80.0
print(np.mean(TeamB))
70.0
stats.trim_mean(TeamB, proportiontocut = 0.20)
80.0
### Trimmed mean for Team B at 20%
WinsorB = np.clip(TeamB, np.percentile(TeamB, 20), np.percentile(TeamB, 80))
print(WinsorB)
0 80
1 80
2 80
3 80
4 70
print(np.mean(WinsorB))
78.0
### Winsorized mean for Team B at 20%
Geometric mean
The geometric mean can be calculated with the gmean function in SciPy stats. It can also be calculated with a nested log, mean, and exp functions in NumPy.
Bacteria example
Bacteria = pd.Series([20, 40, 50, 60, 100, 120, 150, 200, 1000])
print(stats.gmean(Bacteria))
98.38886970015734
np.exp(np.mean(np.log(Bacteria)))
98.38886970015734
Annual return example
Note that with the annual return example, that 1 must be added to all the values, and then subtracted at the end to return the values to proportion of return.
Return = pd.Series([-0.80, 0.20, 0.30, 0.30, 0.20, 0.30])
print(stats.gmean(Return+1)-1)
-0.07344666385976306
Harmonic mean
The harmonic mean can be calculated with the hmean function in SciPy stats. It can also be calculated by taking the inverse of each value, averaging these values, and reporting the inverse of this result.
Speed = pd.Series([20, 40, 50, 60, 100, 120, 150, 200, 1000])
print(stats.hmean(Speed))
63.08411214953271
print(1/np.mean(1/Speed))
63.08411214953271