Icy Hot Power
Authors: Penelope King, Garvey Li
Introduction
The Question and the Dataset
Power outages pose as a hinderance to the lives of many people, especially since many modern inventions rely on the use of electricity in order to function. Power outages can be distruptions to many of our modern conveniences as well, in areas such as transportation, communication, medical, and food. Keeping track and noticing trends in power outages is especially important to help become more knowledgeable about how to predict and prevent power outages to help the quality of lives of many. Specifically for our focus for this project, we are focusing on noticing trends about power outages to help gain a greater understanding of how power outages may act depending on differing factors of where a person may live in the United States.
Is the distribution of outages across U.S. climate regions the same in both winter and summer?
Understanding how the distribution of power outages in climate regions may differ between seasons of extreme weather (summer and winter) is important to better predicting outages and when to expect them. It also helps us better understand the nature of power outages in the United States, and can help us tell if differences in weather affect power outages.
This EDA project is based on data collected on major power outage events that occurred in the United States. This dataset consists of 1534 rows, where each represents a power outage occurance in the United States. The columns in this dataset that are relevant to our question are titled as:
MONTH
: An integer representing the month that the power outage occured in
YEAR
: The year in which the outage occured as an integer
U.S_STATE
: A string representing the name of the state that the outage occured in
POSTAL.CODE
: A string representation of the postal code of the state (e.g. California’s postal code is CA)
CLIMATE.REGION
: The common US climate region that the location of the power outage is a part of (e.g. Midwest)
CLIMATE.CATEGORY
: The category of the climate at the time (cold, normal, or warm)
CAUSE.CATEGORY
: The cause of the power outage (severe weather, intentional attack, system operability disruption, public appeal, equipment failure, fuel supply emergency, islanding)
CAUSE.CATEGORY.DETAIL
: Additional details to the cause for the power outage
OUTAGE.START.TIME
: The start time of the power outage (HH:MM:SS
)
Cleaning and EDA
Data Cleaning
The first step we took in data cleaning was seeing which columns contained any missing values. We found that some of the columns contained a large number of null values, such as HURRICANE.NAME
, CAUSE.CATEGORY.DETAIL
, DEMAND.LOWW.MW
, and CUSTOMERS.AFFECTED
. However, since our question didn’t require any data from these columns, we decided to leave them as null values.
In order to group the outages by seasons, we had to create a new column, SEASONS
, based off the column MONTH
, which contained values winter
(Dec-Feb), spring
(Mar-May), summer
(Jun-Aug), and fall
(Sept-Nov). As it turns out the MONTH
and OUTAGE.START.DATE
categories were missing for 9 of the rows, which meant we had to find another way to determine the season for these outages.
For one of the rows, we noticed that one of the columns, CAUSE.CATEGORY.DETAIL
, contained a value winter storm
, which allowed us to infer that that outage occurred in the winter.
For the remaining 8 rows that didn’t have any explicit information regarding time of year, we decided to use probabilistic imputation based off the proportion of seasons for each cause in CAUSE.CATEGORY
(from rows whose MONTH
or OUTAGE.START.DATE
was not missing).
In addition to this, Alaska and Hawaii are not assigned any U.S. Climate Regions, so we assigned the climate regions Northern Temperate
and Tropics
to them, respectively.
occ.head()
YEAR | MONTH | U.S._STATE | POSTAL.CODE | NERC.REGION | CLIMATE.REGION | CLIMATE.CATEGORY | CAUSE.CATEGORY | CAUSE.CATEGORY.DETAIL | OUTAGE.START.DATE | OUTAGE.RESTORATION.DATE | HURRICANE.NAMES | SEASON | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011 | 7 | Minnesota | MN | MRO | East North Central | normal | severe weather | nan | 2011-07-01 00:00:00 | 2011-07-03 00:00:00 | nan | summer |
1 | 2014 | 5 | Minnesota | MN | MRO | East North Central | normal | intentional attack | vandalism | 2014-05-11 00:00:00 | 2014-05-11 00:00:00 | nan | spring |
2 | 2010 | 10 | Minnesota | MN | MRO | East North Central | cold | severe weather | heavy wind | 2010-10-26 00:00:00 | 2010-10-28 00:00:00 | nan | fall |
3 | 2012 | 6 | Minnesota | MN | MRO | East North Central | normal | severe weather | thunderstorm | 2012-06-19 00:00:00 | 2012-06-20 00:00:00 | nan | summer |
4 | 2015 | 7 | Minnesota | MN | MRO | East North Central | warm | severe weather | nan | 2015-07-18 00:00:00 | 2015-07-19 00:00:00 | nan | summer |
Univariate Analysis
Focusing on univariate analysis, we created a graph to see what the distribution of seasons looked like in our DataFrame via a bar graph. Looking at our graph, we noticed that there was a high number of summer and winter power outages that occurred in our dataset. It seems that power outages recorded happen a lot during the winter and summer months.
We also looked at the distribution for different causes for power outages in our dataset. Looking at our bar graph, a high count of our data is represented in severe weather. The majority of the data recorded in our dataset seems to be power outages caused by severe weather. The second highest cause of power outages seems to be intentional attacks.
Bivariate Analysis
After looking at different trends in individual columns of the dataset, we moved onto bivariate analysis to see what interactions between columns were happening in our data. Specifically we looked at what the distribution of seasons were for each cause for power outage in CAUSE.CATEGORY. We noticed that summer was highly represented in public appeal and equipment failure. Winter was the highest proportion of outages caused by fuel supply emergencies.
We also wanted to see which regions of the US had more winter caused accidents rather than summer caused accidents. To achieve this we created a choropleth which showed the proportion of winter caused power outages out of the total number of winter and summer power outages per state. We did this by filtering the data frame to have only winter and summer seasons and then grouping by postal code. Then we applied a lambda function to get the proportion of winter power outages out of winter and summer power outages. The darker the color on the map, the more winter power outages a state has compared to summer power outages. Before making the graph we thought that areas with more extreme weather in the winter may be expressing higher proportions of winter power outages. Looking at the graph, areas of the US known to have worse winters do seem to have darker proportions of winter caused power outages on our choropleth, but this is not an absolute rule.
Interesting Aggregates
Aggregating the data in different ways in table format also allows us to have a better understanding of our data. Specifically we wanted to see what were the most common occurrences for power outages by CLIMATE.REGION. We looked at this to better understand if climate regions had a large difference in the representation of different months, seasons, states, and causes. Looking at the grouped table we created, we noticed that the modes of all the climate regions were limited to summer and winter seasons. Furthermore, the most represented cause categories in these various climate categories were severe weather, intentional attack, and islanding. This makes sense, since summer and winter are known to have more extreme weather than spring and fall.
CLIMATE.REGION | YEAR | MONTH | SEASON | CAUSE.CATEGORY |
---|---|---|---|---|
Central | 2011 | 6.0 | summer | severe weather |
East North Central | 2014 | [6. 7.] | summer | severe weather |
North Temperate Zone | 2000 | [] | summer | equipment failure |
Northeast | 2011 | 10.0 | summer | severe weather |
Northwest | 2011 | 12.0 | winter | intentional attack |
South | 2011 | 8.0 | summer | severe weather |
Southeast | 2004 | 8.0 | summer | severe weather |
Southwest | 2013 | 2.0 | winter | intentional attack |
Tropics | 2006 | 10.0 | fall | severe weather |
West | 2015 | 7.0 | winter | severe weather |
West North Central | [2009 2010 2011 2013] | 6.0 | summer | islanding |
We also created a pivot table that looked at the distribution between summer and winter outages and the respective climate region that the outage occurred in. There seems to be some differences in some of the climate regions when it comes to if summer or winter has more power outages. We know that the US is known to have various climate regions with very different seasonal experiences. Summer and winter in particular are both seasons with extreme weather, but differ greatly between which area of the US a person is in. Would differences in weather affect the number of power outages between the two seasons?
CLIMATE.REGION | summer | winter |
---|---|---|
Central | 87 | 45 |
East North Central | 55 | 27 |
North Temperate Zone | 1 | nan |
Northeast | 112 | 83 |
Northwest | 38 | 47 |
South | 87 | 46 |
Southeast | 55 | 43 |
Southwest | 25 | 29 |
Tropics | 1 | 1 |
West | 61 | 62 |
West North Central | 9 | 2 |
Assessment of Missingness
NMAR Analysis
The column CLIMATE.REGION is NMAR. All outages in Alaska and Hawaii are missing CLIMATE.REGION because they are not a part of the continental United States, and therefore have no U.S Climate Region. So, the null values in the CLIMATE.REGION column are dependent on the fact that Alaska and Hawaii have no U.S. Climate Region(We believe that it is not missing by design since the CLIMATE.REGION of Hawaii or Alaska cannot be inferred from any other columns, since the value does not exist).
Missingness Dependency: MAR vs MCAR Imputation Tests
For our analysis of missingness, we decided to look at the missingness of CAUSE.CATEGORY.DETAIL
. 30.7% of this column is missing values, so its missingness is not non-trivial.
YEAR
The first column we decided to analyze the missingness of CAUSE.CATEGORY.DETAIL
on, was YEAR
, and our hypotheses are as follows:
Null Hypothesis: The missingness of CAUSE.CATEGORY.DETAIL
does not depend on YEAR
with a significance level of 0.05.
Alternative Hypothesis: The missingness of CAUSE.CATEGORY.DETAIL
does depend on YEAR
with a significance level of 0.05.
We calculated the TVD of the proportions of missing and not missing details per year and got an observed TVD of 0.2775. Looking at two distributions for missing and not missing, it seems as though they are extremely different. Which might suggest that the missingness of CAUSE.CATEGORY.DETAIL
relies on the values of YEAR
.
After running 100,000 permutations on the CAUSE.CATEGORY.DETAIL
column and computing the TVDs between missing and not missing for each YEAR
, we got a p-value of 0, meaning that none of the permutations resulted in a TVD as large as the observed TVD.
Since our p-value (0.0) is lower than our significance level (0.05), we reject the null hypothesis, so it is possible that the missingness of CAUSE.CATEGORY.DETAIL
depends on YEAR
so the missingness is MAR in relation to CAUSE.CATEGORY.DETAIL
.
OUTAGE.START.TIME
The next column we decided to analyze the missingness of CAUSE.CATEGORY.DETAIL
on, was OUTAGE.START.TIME
, and our hypotheses are as follows:
Null Hypothesis: The missingness of CAUSE.CATEGORY.DETAIL
does not depend on OUTAGE.START.TIME
Alternative Hypothesis: The missingness of CAUSE.CATEGORY.DETAIL
does depend on OUTAGE.START.TIME
Since OUTAGE.START.TIME
is in the format HH:MM:SS
, we decided to convert the times in a unit of seconds
since 00:00:00
. From there, we split up the data into to Series: One where CAUSE.CATEGORY.DETAIL
was missing and one where it wan’t missing. We then ran a K-S 2 sample test(since the distributions were quantitative/numeric) on the two Series of data and got the following CDFs.
The p-value from this K-S 2 sample test was 0.1049, which is greater than our significance level of 0.05. Rherefore we fail to reject the null – it is possible that the missingness of CAUSE.CATEGORY.DETAIL
does not depend on OUTAGE.START.TIME
, so the missing values in the column CAUSE.CATEGORY.DETAIL
is MCAR in relation to OUTAGE.START.TIME
.
Hypothesis Test
Finally for the hypothesis test. During the bivariate analysis we noticed that there seemed to be a difference between if a power outage would be more likely to occur during summer or winter depending on the climate region a person lived in. Could this be due to random chance? Specifically, we are testing the question: Does climate region power outage distribution across winter and summer seasons differ?
We set up our experiment like this:
Hypotheses
Null Hypothesis:
In the population, the distribution of power outages for US climate regions are the same between summer and winter, and the observed differences in our sample are due to random chance.
Alternative Hypothesis:
In the population, the distribution of power outages of US climate region groups are different for summer and winter.
Testing
Our test statistic was total variation distance (TVD) because we were looking at two different categorical distributions and how they differed from each other. We set our significance level to the standard level of 0.05. That is, we are aiming for a 95% confidence level. For this experiment we used a permutation test because we are testing to see if two different sample distributions come from the same population distribution.
Results
Aftering running the permutation test, our p-value came to be around 0.0016.
Our p-value was less than our set significance level (0.0016 < 0.05), so we reject the null in favor of the alternative hypothesis. It is possible that in the US, a person can experience differing levels of summer or winter power outages based on what U.S. climate region they live in.