Icy Hot Power

Authors: Penelope King, Garvey Li

Introduction

The Question and the Dataset

Power outages pose as a hinderance to the lives of many people, especially since many modern inventions rely on the use of electricity in order to function. Power outages can be distruptions to many of our modern conveniences as well, in areas such as transportation, communication, medical, and food. Keeping track and noticing trends in power outages is especially important to help become more knowledgeable about how to predict and prevent power outages to help the quality of lives of many. Specifically for our focus for this project, we are focusing on noticing trends about power outages to help gain a greater understanding of how power outages may act depending on differing factors of where a person may live in the United States.

Is the distribution of outages across U.S. climate regions the same in both winter and summer?

Understanding how the distribution of power outages in climate regions may differ between seasons of extreme weather (summer and winter) is important to better predicting outages and when to expect them. It also helps us better understand the nature of power outages in the United States, and can help us tell if differences in weather affect power outages.

This EDA project is based on data collected on major power outage events that occurred in the United States. This dataset consists of 1534 rows, where each represents a power outage occurance in the United States. The columns in this dataset that are relevant to our question are titled as: MONTH: An integer representing the month that the power outage occured in YEAR: The year in which the outage occured as an integer U.S_STATE: A string representing the name of the state that the outage occured in POSTAL.CODE: A string representation of the postal code of the state (e.g. California’s postal code is CA) CLIMATE.REGION: The common US climate region that the location of the power outage is a part of (e.g. Midwest) CLIMATE.CATEGORY: The category of the climate at the time (cold, normal, or warm) CAUSE.CATEGORY: The cause of the power outage (severe weather, intentional attack, system operability disruption, public appeal, equipment failure, fuel supply emergency, islanding) CAUSE.CATEGORY.DETAIL: Additional details to the cause for the power outage OUTAGE.START.TIME: The start time of the power outage (HH:MM:SS)

Cleaning and EDA

Data Cleaning

The first step we took in data cleaning was seeing which columns contained any missing values. We found that some of the columns contained a large number of null values, such as HURRICANE.NAME, CAUSE.CATEGORY.DETAIL, DEMAND.LOWW.MW, and CUSTOMERS.AFFECTED. However, since our question didn’t require any data from these columns, we decided to leave them as null values.

In order to group the outages by seasons, we had to create a new column, SEASONS, based off the column MONTH, which contained values winter(Dec-Feb), spring(Mar-May), summer(Jun-Aug), and fall(Sept-Nov). As it turns out the MONTH and OUTAGE.START.DATE categories were missing for 9 of the rows, which meant we had to find another way to determine the season for these outages.

For one of the rows, we noticed that one of the columns, CAUSE.CATEGORY.DETAIL, contained a value winter storm, which allowed us to infer that that outage occurred in the winter.

For the remaining 8 rows that didn’t have any explicit information regarding time of year, we decided to use probabilistic imputation based off the proportion of seasons for each cause in CAUSE.CATEGORY(from rows whose MONTH or OUTAGE.START.DATE was not missing).

In addition to this, Alaska and Hawaii are not assigned any U.S. Climate Regions, so we assigned the climate regions Northern Temperate and Tropics to them, respectively.

occ.head()

	YEAR	MONTH	U.S._STATE	POSTAL.CODE	NERC.REGION	CLIMATE.REGION	CLIMATE.CATEGORY	CAUSE.CATEGORY	CAUSE.CATEGORY.DETAIL	OUTAGE.START.DATE	OUTAGE.RESTORATION.DATE	HURRICANE.NAMES	SEASON
0	2011	7	Minnesota	MN	MRO	East North Central	normal	severe weather	nan	2011-07-01 00:00:00	2011-07-03 00:00:00	nan	summer
1	2014	5	Minnesota	MN	MRO	East North Central	normal	intentional attack	vandalism	2014-05-11 00:00:00	2014-05-11 00:00:00	nan	spring
2	2010	10	Minnesota	MN	MRO	East North Central	cold	severe weather	heavy wind	2010-10-26 00:00:00	2010-10-28 00:00:00	nan	fall
3	2012	6	Minnesota	MN	MRO	East North Central	normal	severe weather	thunderstorm	2012-06-19 00:00:00	2012-06-20 00:00:00	nan	summer
4	2015	7	Minnesota	MN	MRO	East North Central	warm	severe weather	nan	2015-07-18 00:00:00	2015-07-19 00:00:00	nan	summer

Univariate Analysis

Focusing on univariate analysis, we created a graph to see what the distribution of seasons looked like in our DataFrame via a bar graph. Looking at our graph, we noticed that there was a high number of summer and winter power outages that occurred in our dataset. It seems that power outages recorded happen a lot during the winter and summer months.

We also looked at the distribution for different causes for power outages in our dataset. Looking at our bar graph, a high count of our data is represented in severe weather. The majority of the data recorded in our dataset seems to be power outages caused by severe weather. The second highest cause of power outages seems to be intentional attacks.

Bivariate Analysis

After looking at different trends in individual columns of the dataset, we moved onto bivariate analysis to see what interactions between columns were happening in our data. Specifically we looked at what the distribution of seasons were for each cause for power outage in CAUSE.CATEGORY. We noticed that summer was highly represented in public appeal and equipment failure. Winter was the highest proportion of outages caused by fuel supply emergencies.

We also wanted to see which regions of the US had more winter caused accidents rather than summer caused accidents. To achieve this we created a choropleth which showed the proportion of winter caused power outages out of the total number of winter and summer power outages per state. We did this by filtering the data frame to have only winter and summer seasons and then grouping by postal code. Then we applied a lambda function to get the proportion of winter power outages out of winter and summer power outages. The darker the color on the map, the more winter power outages a state has compared to summer power outages. Before making the graph we thought that areas with more extreme weather in the winter may be expressing higher proportions of winter power outages. Looking at the graph, areas of the US known to have worse winters do seem to have darker proportions of winter caused power outages on our choropleth, but this is not an absolute rule.

Interesting Aggregates

Aggregating the data in different ways in table format also allows us to have a better understanding of our data. Specifically we wanted to see what were the most common occurrences for power outages by CLIMATE.REGION. We looked at this to better understand if climate regions had a large difference in the representation of different months, seasons, states, and causes. Looking at the grouped table we created, we noticed that the modes of all the climate regions were limited to summer and winter seasons. Furthermore, the most represented cause categories in these various climate categories were severe weather, intentional attack, and islanding. This makes sense, since summer and winter are known to have more extreme weather than spring and fall.

CLIMATE.REGION	YEAR	MONTH	SEASON	CAUSE.CATEGORY
Central	2011	6.0	summer	severe weather
East North Central	2014	[6. 7.]	summer	severe weather
North Temperate Zone	2000	[]	summer	equipment failure
Northeast	2011	10.0	summer	severe weather
Northwest	2011	12.0	winter	intentional attack
South	2011	8.0	summer	severe weather
Southeast	2004	8.0	summer	severe weather
Southwest	2013	2.0	winter	intentional attack
Tropics	2006	10.0	fall	severe weather
West	2015	7.0	winter	severe weather
West North Central	[2009 2010 2011 2013]	6.0	summer	islanding

We also created a pivot table that looked at the distribution between summer and winter outages and the respective climate region that the outage occurred in. There seems to be some differences in some of the climate regions when it comes to if summer or winter has more power outages. We know that the US is known to have various climate regions with very different seasonal experiences. Summer and winter in particular are both seasons with extreme weather, but differ greatly between which area of the US a person is in. Would differences in weather affect the number of power outages between the two seasons?

CLIMATE.REGION	summer	winter
Central	87	45
East North Central	55	27
North Temperate Zone	1	nan
Northeast	112	83
Northwest	38	47
South	87	46
Southeast	55	43
Southwest	25	29
Tropics	1	1
West	61	62
West North Central	9	2

Assessment of Missingness

NMAR Analysis

The column CLIMATE.REGION is NMAR. All outages in Alaska and Hawaii are missing CLIMATE.REGION because they are not a part of the continental United States, and therefore have no U.S Climate Region. So, the null values in the CLIMATE.REGION column are dependent on the fact that Alaska and Hawaii have no U.S. Climate Region(We believe that it is not missing by design since the CLIMATE.REGION of Hawaii or Alaska cannot be inferred from any other columns, since the value does not exist).

Missingness Dependency: MAR vs MCAR Imputation Tests

For our analysis of missingness, we decided to look at the missingness of CAUSE.CATEGORY.DETAIL. 30.7% of this column is missing values, so its missingness is not non-trivial.

YEAR

The first column we decided to analyze the missingness of CAUSE.CATEGORY.DETAIL on, was YEAR, and our hypotheses are as follows:

Null Hypothesis: The missingness of CAUSE.CATEGORY.DETAIL does not depend on YEAR with a significance level of 0.05.

Alternative Hypothesis: The missingness of CAUSE.CATEGORY.DETAIL does depend on YEAR with a significance level of 0.05.

We calculated the TVD of the proportions of missing and not missing details per year and got an observed TVD of 0.2775. Looking at two distributions for missing and not missing, it seems as though they are extremely different. Which might suggest that the missingness of CAUSE.CATEGORY.DETAIL relies on the values of YEAR.

After running 100,000 permutations on the CAUSE.CATEGORY.DETAIL column and computing the TVDs between missing and not missing for each YEAR, we got a p-value of 0, meaning that none of the permutations resulted in a TVD as large as the observed TVD.

Since our p-value (0.0) is lower than our significance level (0.05), we reject the null hypothesis, so it is possible that the missingness of CAUSE.CATEGORY.DETAIL depends on YEAR so the missingness is MAR in relation to CAUSE.CATEGORY.DETAIL.

OUTAGE.START.TIME

The next column we decided to analyze the missingness of CAUSE.CATEGORY.DETAIL on, was OUTAGE.START.TIME, and our hypotheses are as follows:

Null Hypothesis: The missingness of CAUSE.CATEGORY.DETAIL does not depend on OUTAGE.START.TIME

Alternative Hypothesis: The missingness of CAUSE.CATEGORY.DETAIL does depend on OUTAGE.START.TIME

Since OUTAGE.START.TIME is in the format HH:MM:SS, we decided to convert the times in a unit of seconds since 00:00:00. From there, we split up the data into to Series: One where CAUSE.CATEGORY.DETAIL was missing and one where it wan’t missing. We then ran a K-S 2 sample test(since the distributions were quantitative/numeric) on the two Series of data and got the following CDFs.

The p-value from this K-S 2 sample test was 0.1049, which is greater than our significance level of 0.05. Rherefore we fail to reject the null – it is possible that the missingness of CAUSE.CATEGORY.DETAIL does not depend on OUTAGE.START.TIME , so the missing values in the column CAUSE.CATEGORY.DETAIL is MCAR in relation to OUTAGE.START.TIME.

Hypothesis Test

Finally for the hypothesis test. During the bivariate analysis we noticed that there seemed to be a difference between if a power outage would be more likely to occur during summer or winter depending on the climate region a person lived in. Could this be due to random chance? Specifically, we are testing the question: Does climate region power outage distribution across winter and summer seasons differ?

We set up our experiment like this:

Hypotheses

Null Hypothesis:

In the population, the distribution of power outages for US climate regions are the same between summer and winter, and the observed differences in our sample are due to random chance.

Alternative Hypothesis:

In the population, the distribution of power outages of US climate region groups are different for summer and winter.

Testing

Our test statistic was total variation distance (TVD) because we were looking at two different categorical distributions and how they differed from each other. We set our significance level to the standard level of 0.05. That is, we are aiming for a 95% confidence level. For this experiment we used a permutation test because we are testing to see if two different sample distributions come from the same population distribution.

Results

Aftering running the permutation test, our p-value came to be around 0.0016.

Our p-value was less than our set significance level (0.0016 < 0.05), so we reject the null in favor of the alternative hypothesis. It is possible that in the US, a person can experience differing levels of summer or winter power outages based on what U.S. climate region they live in.