Introduction

I found this dataset by chance on data.world and it immediately sparked in interest as I have two small children and recently moved to India in 2017. The data is organized by state and specific crime from 2001 to 2012. It is a bit dated and not as granular as I would like (by city would have been nice), but the dataset is still worth exploring and practicing some basic skills.

It should be noted that there generally isn’t any information about how this data was collected. There are certain crimes that appear more prevalent across all states and some for which there is no account. Perhaps people are less likely to report some crimes and more likely to report others. For the purpose of this analysis, I will take the data at face value and make assumptions along the way.

The dataset can be found here.

Load the necessary libraries

library(plotly)
library(data.world)
library(tidyverse)
library(stringr)
library(stringi)
library(maptools)
library(RColorBrewer)
library(gridExtra)
library(ggthemes)
library(rcartocolor)
library(lubridate)
library(ghibli)
library(mapproj)

if (!require(gpclib)) install.packages("gpclib", type="source")
gpclibPermit()
## [1] TRUE

Accessing the data

As per data.world’s automatically generated notebook, the first step is querying the database and checking what tables are included.

# Datasets are referenced by their URL or path
dataset_key <- "https://data.world/bhavnachawla/crime-rate-against-children-india-2001-2012"
# List tables available for SQL queries
tables_qry <- data.world::qry_sql("SELECT * FROM Tables")
tables_df <- data.world::query(tables_qry, dataset = dataset_key)
## Rows: 1 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): tableId, tableName, tableTitle, owner, dataset
## lgl (1): tableDescription
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# See what is in it
tables_df$tableName
## [1] "crime_head_wise_persons_arrested_under_crime_against_children_during_2001_2012"

Next, we query the table found.

if (length(tables_df$tableName) > 0) {
  sample_qry <- data.world::qry_sql(sprintf("SELECT * FROM `%s`", tables_df$tableName[[1]]))
  sample_df <- data.world::query(sample_qry, dataset = dataset_key)
  #knitr::kable(head(sample_df), format = "html")
  sample_df
}
## Rows: 494 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): state_ut, crime_head
## dbl (12): 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 494 × 14
##    state_ut      crime…¹ `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008`
##    <chr>         <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 ANDHRA PRADE… INFANT…      1      1      3      0      0      0      1      0
##  2 ARUNACHAL PR… INFANT…      0      0      0      0      0      0      0      0
##  3 ASSAM         INFANT…      0      5      0      0      1      0      0      0
##  4 BIHAR         INFANT…      0      0      0      0      2      0      2      2
##  5 CHHATTISGARH  INFANT…      7     29      5     12      0     15     11      6
##  6 GOA           INFANT…      0      0      0      0      0      1      0      0
##  7 GUJARAT       INFANT…      2      0      0      1      6      0      7      0
##  8 HARYANA       INFANT…      0      3      0      1      0      0      1      5
##  9 HIMACHAL PRA… INFANT…      0      0      3      0      0      0      0      0
## 10 JAMMU & KASH… INFANT…      0      0      0      0      0      0      0      0
## # … with 484 more rows, 4 more variables: `2009` <dbl>, `2010` <dbl>,
## #   `2011` <dbl>, `2012` <dbl>, and abbreviated variable name ¹​crime_head

Data Cleaning

Now that we have data to work with, it makes sense to check for missing data, misspellings, and generally reshaping the data to make it easier to work with.

First, I’ll check for NA’s.

# check for NA's
any(is.na(sample_df))
## [1] FALSE

Since there aren’t any missing values, I’ll move on to checking for duplicates and typos (or duplicates caused by typos) in the state and crime columns. Below, we identify 35 unique states (38 less 3 totals) and 12 unique crimes (also excluding total crime).

Check for duplicate states:

sample_df %>%
  arrange(state_ut) %>%
  select(state_ut) %>%
  unique() 
## # A tibble: 38 × 1
##    state_ut         
##    <chr>            
##  1 A & N ISLANDS    
##  2 ANDHRA PRADESH   
##  3 ARUNACHAL PRADESH
##  4 ASSAM            
##  5 BIHAR            
##  6 CHANDIGARH       
##  7 CHHATTISGARH     
##  8 D & N HAVELI     
##  9 DAMAN & DIU      
## 10 DELHI            
## # … with 28 more rows

Check for duplicate crime types:

sample_df %>%
  arrange(crime_head) %>%
  select(crime_head) %>%
  unique()
## # A tibble: 13 × 1
##    crime_head                          
##    <chr>                               
##  1 ABETMENT OF SUICIDE                 
##  2 BUYING OF GIRLS FOR PROSTITUTION    
##  3 EXPOSURE AND ABANDONMENT            
##  4 FOETICIDE                           
##  5 INFANTICIDE                         
##  6 KIDNAPPING and ABDUCTION OF CHILDREN
##  7 MURDER OF CHILDREN                  
##  8 OTHER CRIMES AGAINST CHILDREN       
##  9 PROCURATION OF MINOR GILRS          
## 10 PROHIBITION OF CHILD MARRIAGE ACT   
## 11 RAPE OF CHILDREN                    
## 12 SELLING OF GIRLS FOR PROSTITUTION   
## 13 TOTAL CRIMES AGAINST CHILDREN

There are number of observations labeled “total” in the states column that I don’t really need so I’ll exclude them when creating a new dataframe (leaving the totals in the crime column). I’ll fix a typo and convert to states and crimes to title case.

#remove totals from state column -- NOTE that I leave the total in the crime column
df <- sample_df[!grepl("TOTAL", sample_df$state_ut),]

# fix typo
df$crime_head[df$crime_head=="PROCURATION OF MINOR GILRS"] <- "PROCURATION OF MINOR GIRLS"

#convert to title case
df$crime_head <- str_to_title(df$crime_head)
df$state_ut <- str_to_title(df$state_ut)

Tidy Data

The data table appears to be set up to be readable in Excel (from my point of view). Gathering the years into one variable will make it easier to work with.

df <- df %>% gather("year", df, -state_ut, -crime_head, convert = T)

Exploratory Data Analysis

Identify prevalent crimes in Tamil Nadu in 2012

I am still new to this and I suspect it makes more sense to begin with macro level analysis, but I started by focusing on the state of Tamil Nadu since that’s where I live. I was curious to see what crimes are most prevalent in this state.

df %>%
  filter(state_ut == "Tamil Nadu" & year == 2012) %>%
  arrange(desc(df)) 
## # A tibble: 13 × 4
##    state_ut   crime_head                            year    df
##    <chr>      <chr>                                <int> <dbl>
##  1 Tamil Nadu Total Crimes Against Children         2012  1105
##  2 Tamil Nadu Kidnapping And Abduction Of Children  2012   560
##  3 Tamil Nadu Rape Of Children                      2012   333
##  4 Tamil Nadu Murder Of Children                    2012   118
##  5 Tamil Nadu Other Crimes Against Children         2012    49
##  6 Tamil Nadu Procuration Of Minor Girls            2012    41
##  7 Tamil Nadu Abetment Of Suicide                   2012     2
##  8 Tamil Nadu Infanticide                           2012     1
##  9 Tamil Nadu Exposure And Abandonment              2012     1
## 10 Tamil Nadu Foeticide                             2012     0
## 11 Tamil Nadu Buying Of Girls For Prostitution      2012     0
## 12 Tamil Nadu Selling Of Girls For Prostitution     2012     0
## 13 Tamil Nadu Prohibition Of Child Marriage Act     2012     0

After identifying the most significant crimes in 2012, I chart how the frequency of these crimes changed over time.


crimes <- c("Kidnapping And Abduction Of Children",
            "Murder Of Children",
            "Rape Of Children",
            "Other Crimes Against Children")

df %>%
  filter((state_ut == "Tamil Nadu") & (crime_head %in% crimes )) %>%
  mutate(crime_head = str_replace(crime_head, "And", "&"),
         crime_head = str_remove(crime_head, " Of Children")) %>%
  ggplot(aes(year, df)) + 
  geom_line(color = ghibli_palettes$MononokeMedium[2], size = 1.5) +
  geom_point(shape = 21, size = 2.5, col = my_bkgd, fill = ghibli_palettes$MononokeMedium[2]) +
  facet_wrap(~ crime_head, ncol = 2) +
  labs(y = NULL, x = NULL,
       title = "Growth in Number of Crimes Against Children in Tamil Nadu",
       subtitle = "Most Significant Types of Crime as of 2012") +
  scale_x_continuous(labels = function(x) as.integer(x)) +
  theme(axis.text.x = element_text(hjust=1),
        panel.spacing = unit(1.5, "lines"),
        panel.border = element_rect(color = "grey80", fill = NA))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

Kidnapping and rape appear to have the most alarming trajectories. I’m curious what average annual growth looks like.

df %>%
  filter(state_ut == "Tamil Nadu", crime_head %in% crimes) %>% 
  group_by(crime_head) %>%
  summarize(CAGR =  scales::percent((df[year == 2012] / df[year == 2001]) ^ (1/11) - 1)) %>%
  arrange(desc(CAGR)) 
## # A tibble: 4 × 2
##   crime_head                           CAGR 
##   <chr>                                <chr>
## 1 Kidnapping And Abduction Of Children 47%  
## 2 Rape Of Children                     28%  
## 3 Murder Of Children                   18%  
## 4 Other Crimes Against Children        12%

Kidnappings have grown by almost 50% a year!

Kidnappings and Abductions by State

To add a little more context, I’ll take a look at kidnapping and abductions by state. Below, I select 12 states that have had the most kidnappings over the 12-year period.

top_k <- 12

high_ka_states <- df %>%
  group_by(state_ut) %>%
  filter(crime_head == "Kidnapping And Abduction Of Children") %>%
  summarise(stotal = sum(df, na.rm = T)) %>%
  top_n(top_k)

kidnapping_plot <- df %>%
  filter(crime_head == "Kidnapping And Abduction Of Children", state_ut %in% high_ka_states$state_ut) %>%
  ggplot(aes(x=year,y=df, fill=state_ut, text = paste0("Year: ", year,"\nTotal: ", df))) +
  geom_bar(stat='identity') + 
  labs(title = 'Kidnapping And Abduction by State, 2001 - 2012', y = 'Number of Crimes', x='') +
  scale_x_continuous(labels = function(x) as.integer(x), breaks = seq(2000,2012,3)) +
  facet_wrap(~state_ut) + 
  scale_fill_manual(values = ghibli_palette(top_k, name = "MononokeMedium", type = "continuous")) +
  theme(legend.position='none',
        panel.spacing = unit(1.5, "lines"))

kidnapping_plot

Uttar Pradesh seems to stand out quite a bit, especially in 2012. Taking a closer look, we see it has had more than 4x the number of kidnappings than any other state in 2012!

lollipop_color <- ghibli_palettes$MononokeMedium[4]
hilight_color <- ghibli_palettes$MononokeMedium[6]

df %>%
  group_by(crime_head) %>%
  filter(df > 100) %>%
  ungroup() %>%
  filter(crime_head == "Kidnapping And Abduction Of Children", year == 2012, df > 10) %>%
  mutate(state_ut = reorder(state_ut, df)) %>%
  ggplot(aes(x = state_ut, y = df)) + 
  geom_segment(aes(y = 0, yend = df, 
                   x = state_ut, xend = state_ut, 
                   color = if_else(state_ut == "Tamil Nadu", hilight_color, lollipop_color)), 
               size = 2) +
  geom_point(stat='identity', aes(color = if_else(state_ut == "Tamil Nadu", hilight_color, lollipop_color)), 
             size = 4) + 
  scale_color_manual(values = c(lollipop_color,hilight_color)) +
  coord_flip() +
  geom_text(aes(y = df, x = state_ut, label = df), 
            nudge_y = 200, hjust = 0, family = my_font, color = "#22211d") +
  expand_limits(y = c(0, 9000)) +
  labs(title = 'Number of Kidnappings & Abductions in 2012',
       subtitle = 'By State', 
       y = '', x='') +
  theme(legend.position='none',
        axis.text.x = element_blank(),
        panel.grid = element_blank())

Levelplot

The next question I have is what crimes are most significant in each state? A heatmap (or levelplot) might be the best way to visualize this. This also allows us to visualize the most prevalent crimes throughout India.

# Set up color palette and binned counts
# pal = c("#E2D7C5","#C6B1A9","#9E7D83","#7A4F61","#54203F")
pal = c("#e2d7c5","#c0a68d","#a0765b","#804633","#5e1414")
pal2 = c("#e5e5e2","#E2D7C5","#9E7D83","#54203F")

level_data <- df %>%
  filter(year == '2012', crime_head != "Total Crimes Against Children") %>%
  mutate(crime_head = str_remove(crime_head, " Of Children"))

colnames(level_data) <- c("State","Crime","Year","Count")

label_start <- c(0,10^1,10^2,10^3)
label_end <- c(10^1, 10^2, 10^3, 10^4)

level_data$bin <- cut(level_data$Count, breaks=c(0,10^1,10^2,10^3,10^4), 
                      labels=c(paste0(scales::comma(label_start)," - ",scales::comma(label_end))), 
                      include.lowest = TRUE)

level_data %>% 
  mutate(State = reorder(State, desc(State))) %>%
  ggplot(aes(x=Crime,y=State, z=bin)) +
    geom_tile(aes(fill = bin)) + 
    theme(axis.text.x = element_text(angle=90, hjust=1, vjust = 0.5),
          panel.grid = element_blank(), 
          plot.title = element_text(hjust = 0, face = "bold"),
          plot.margin = margin(10,70,10,20),
          legend.position = "top") +
    scale_fill_carto_d(palette = "Fall", name = "No. of Crimes (log scale)",
                       guide = guide_legend(keyheight = unit(2, units = "mm"), 
                                            keywidth = unit(18, units = "mm"), 
                                            title.position = "top", label.position = "bottom",
                                            title.hjust = 0, label.hjust = 0.1, nrow = 1)) +

  # scale_fill_manual(values = pal, name = "No. of Crimes (log scale)",
  #                   guide = guide_legend(keyheight = unit(2, units = "mm"), 
  #                                        keywidth = unit(18, units = "mm"), 
  #                                        title.position = "top", label.position = "bottom",
  #                                        title.hjust = 0, label.hjust = 0.1, nrow = 1)) +
    labs(x = "", y = "",
         title = "Number of Crimes by State - 2012",
         subtitle = "")

As you can see, kidnappings and rape seem most significant across India. ‘Other’ crime is also significant – more research is necessary to learn what that comprises. It also appears that about half of the crimes are very low or 0 by count, which makes me suspect that data was unavaliable or that such crimes don’t often get reported or prosecuted.

Total Crime By State

Shifting to a more macro view, we’ll take a look at total crimes by state over time. I select the top 12 states by cumulative total crime over the period. From the charts below, it appears that Madhya Pradesh and Maharashtra have had higher crime, but with low growth, over time. Crime in Uttar Pradesh, however, has been sporadic and grew significantly between 2010 and 2012.

Again, I’m interested in average annual growth, but here I take a look at total crimes by state. Tamil Nadu comes out on top. That is likely because we’re dealing with smaller numbers, but the trajectory is still quite steep. Uttar Pradesh had an average annual growth in crime of about 6% from 2001 to 2012, but crime fell from 2001 to 2002. Average growth from 2002 to 2012 was about 14.4%, which is more than twice as fast as indicated, but still places in the lower half of the chart below.

# Create vector to highlight first bar in chart
gr_ch_cols <- c("two", rep("one", 14))

growth.tbl <- df %>%
  filter(crime_head == "Total Crimes Against Children", year %in% c(2001, 2012), df > 0) %>%
  group_by(state_ut) %>%
  summarize(growth = 100 * ((df[year == 2012] / df[year == 2001]) ^ (1/11) - 1) ) %>%
  arrange(desc(growth))
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## `summarise()` has grouped output by 'state_ut'. You can override using the
## `.groups` argument.
growth.tbl %>%
  head(15) %>%
  ggplot(aes(x = fct_reorder(state_ut, growth), y = growth)) + 
  geom_bar(stat='identity', aes(fill = gr_ch_cols)) +
  scale_fill_manual(values = c(ghibli_palettes$MononokeMedium[3], ghibli_palettes$MononokeMedium[1])) +
  coord_flip() +
    geom_text(aes(y = growth, x = seq(15,1), label = paste0(round(growth),"%")), 
              nudge_y = -0.5, hjust = 1, color = my_bkgd, family = my_font) +
    labs(title = 'Geometric Growth Of Total Crimes Against Children', 
         subtitle = '2001 to 2012',
         y = '', x='') +
    theme(legend.position='none', 
          axis.text.x = element_blank(),
          panel.grid = element_blank()) 

Geographic Distribution of Total Crime

Since I’m working with geographic data, I’d like to map it to visualize the relationship between crime and neighboring states. First, I have to prepare the dataframes for mapping and load the shape file for the states of India. I found a really helpful blogpost on this here.

# subset df for 2001 
total_by_state_01 <- df %>%
  filter(crime_head == "Total Crimes Against Children", year == 2001, df >= 0) %>%
  mutate(state_ut = reorder(state_ut, df)) %>%
  select(state_ut, df)

# subset df for 2012 
total_by_state <- df %>%
  filter(crime_head == "Total Crimes Against Children", year == 2012) %>%
  mutate(state_ut = reorder(state_ut, df)) %>%
  select(state_ut, df)

# subset df to display median number crime of crimes for entire period
med_by_state <- df %>%
  filter(crime_head == "Total Crimes Against Children", year==2001, df >= 0) %>%
  group_by(state_ut) %>%
  summarise(median = median(df, na.rm = T)) %>%
  arrange(desc(median))

# load shape file
states.shp <- rgdal::readOGR("India_Shape/IND_adm1.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "/home/dave/R/blog/content/blog/India_Shape/IND_adm1.shp", layer: "IND_adm1"
## with 37 features
## It has 12 fields
## Integer64 fields read as strings:  ID_0 ID_1 CCN_1
gpclibPermit()
## [1] TRUE
states.shp.f <- fortify(states.shp, region = "ID_1")

# create a temporary datafrome from names and ID's
tem_df <- data.frame(states.shp$ID_1, states.shp$NAME_1)

# join mapping dataframes with tem_df to facilitate merging later
total_by_state <- left_join(total_by_state, tem_df, by=c("state_ut" = "states.shp.NAME_1"))
total_by_state_01 <- left_join(total_by_state_01, tem_df, by=c("state_ut" = "states.shp.NAME_1"))
med_by_state <- left_join(med_by_state, tem_df, by=c("state_ut" = "states.shp.NAME_1"))

# renamed columns for readability
colnames(total_by_state) <- c("state","count","id")
colnames(med_by_state) <- c("state","median","id")
colnames(total_by_state_01) <- c("state","count","id")

# fix ID's that didn't quite match up for each dataframe
fix_states <- function(df){
  df$id[df$state == "A & N Islands"] <- 1
  df$id[df$state == "Jammu & Kashmir"] <- 14
  df$id[df$state == "D & N Haveli"] <- 8
  df$id[df$state == "Daman & Diu"] <- 9
  df$id[df$state == "Delhi"] <- 25
  return(df)
}

total_by_state <- fix_states(total_by_state)
total_by_state_01 <- fix_states(total_by_state_01)
med_by_state <- fix_states(med_by_state)

# I found Tamil Nadu was duplicated so the following code removes all duplicates
total_by_state <- total_by_state[!duplicated(total_by_state),]
total_by_state_01 <- total_by_state_01[!duplicated(total_by_state_01),]
med_by_state <- med_by_state[!duplicated(med_by_state),]

# rename columns in growth table (used for geometric mean previously)
colnames(growth.tbl) <- c("state","growth")

# merge growth figures with dataframes -- I decided not to use this in the end but leave it
# so as not to break anything I can't fix
total_by_state <- merge(total_by_state, growth.tbl, by="state", all.x=T)
total_by_state_01 <- merge(total_by_state_01, growth.tbl, by="state", all.x=T)
med_by_state <- merge(med_by_state, growth.tbl, by="state", all.x=T)

# create and sort tables for mapping
merge_tbl <- merge(states.shp.f, total_by_state, by="id", all.x=T)
merge_tbl_01 <- merge(states.shp.f, total_by_state_01, by="id", all.x=T)
merge_tbl_med <- merge(states.shp.f, med_by_state, by="id", all.x=T)

final.plt <- merge_tbl[order(merge_tbl$order),]
final.plt.01 <- merge_tbl_01[order(merge_tbl_01$order),]
final.plt.med <- merge_tbl_med[order(merge_tbl_med$order),]

First, a comparison between the total number of crimes in 2001 and 2012. Note the white state (grey in the median map) just below the center, Telangana. This state was formed from the northwest part of Andhra Pradesh in 2014, after this dataset was created.

# Create dataframe aggregating all crime for all years
total_all_yr <- df %>%
  group_by(year) %>%
  filter(crime_head == "Total Crimes Against Children") %>%
  mutate(state_ut = reorder(state_ut, df)) %>%
  select(state_ut, df)

# Process and merge with map data
total_by_state_all <- left_join(total_all_yr, tem_df, by=c("state_ut" = "states.shp.NAME_1"))
colnames(total_by_state_all) <- c("year","state","count","id")
total_by_state_all <- fix_states(total_by_state_all)
total_by_state_all <- total_by_state_all[!duplicated(total_by_state_all),]
total_by_state_all <- merge(total_by_state_all, growth.tbl, by="state", all.x=T)
merge_tbl_01_12 <- merge(states.shp.f, total_by_state_all, by="id")
final.plt.01_12 <- merge_tbl_01_12[order(merge_tbl_01_12$order),]
final.plt.01_12$year <- parse_date_time(final.plt.01_12$year,"%Y")

label_start <- c(seq(0,10000,2000))
label_end <- c(seq(2000,12000,2000))

final.plt.01_12$bin <- cut(final.plt.01_12$count, breaks=c(seq(0, 12000, 2000), Inf), 
                      labels=c(paste0(scales::comma(label_start)," - ",scales::comma(label_end)), "12000+"), 
                      include.lowest = TRUE)

final.plt.01_12 %>%
  filter(year(year) %in% c(2001,2012)) %>%
  ggplot() +
  geom_polygon(aes(x = long, y = lat, fill = bin, group = group), color = my_bkgd, size = 0.25) + 
  coord_map() +
  scale_fill_carto_d(palette = "Fall", name="No. of Crimes", 
                    guide = guide_legend(keyheight = unit(2, units = "mm"), 
                                         keywidth = unit(18, units = "mm"), 
                                         title.position = "top", label.position = "bottom",
                                         title.hjust = 0, label.hjust = 0.1, nrow = 1)) +
  facet_wrap(.~year(year)) +
  map_theme +
  theme(strip.background = element_blank(),
        strip.text = element_text(face="bold", size = 12, hjust = 0.5),
        legend.box.just = "top",
        legend.background = element_blank(),
        legend.position = c(0.7,0)) +
  labs(x="",y="",
       title = "Number of Crimes in India")

Crime has grown over time, particularly in northern India, and from there, it appears to be growing in nearby states as well. Without population data, it’s difficult to draw much more insight.

A quick look at the median number of crimes over that period tells a similar story, but crime is concentrated a little differently.

myt <- ttheme_default(
  base_size = 8,
  base_family = my_font, 
  base_colour =  "#22211d",
  core = list(fg_params=list(hjust = 0, x = 0.1),
              bg_params=list(fill = "#f6edbd")),
  colhead = list(fg_params=list(col = my_bkgd, fontface = "bold.italic", hjust = 0, x = 0.1),
                 bg_params=list(fill = "#3d5941"))
 )

label_start <- c(seq(0,3000,1500))
label_end <- c(seq(1500,4500,1500))

final.plt.med$bin <- cut(final.plt.med$median, breaks=c(seq(0,4500,1500), Inf), 
                      labels=c(paste0(scales::comma(label_start)," - ",scales::comma(label_end)), "4500+"), 
                      include.lowest = TRUE)

final.plt.med <- final.plt.med %>% filter(!is.na(bin))

median.table <- med_by_state %>% arrange(desc(median)) %>% select(state,median) %>% slice(1:5)
colnames(median.table) <- c("State", "Median")

# Create dataframe of state names and state centers (lat, long)
cnames <- aggregate(cbind(long, lat) ~ state, data=final.plt.med, FUN=function(x) mean(range(x)))

# Process names by replacing last space with newline char
cnames$state <- stri_replace_last_charclass(cnames$state, "\\p{WHITE_SPACE}", "\n")

median_plot <- ggplot() +
  geom_polygon(data = final.plt.med, 
               aes(x = long, y = lat, group = group, fill = bin), 
               color = my_bkgd, size = 0.25) + 
  geom_text(data=cnames, aes(long, lat, label = state), size=2, color = "#22211d", family = my_font) +
  coord_map() +
  scale_fill_carto_d(palette = "Fall", name="Median",
                    guide = guide_legend(keyheight = unit(2, units = "mm"), 
                                         keywidth = unit(20, units = "mm"), 
                                         title.position = "top", label.position = "bottom",
                                         title.hjust = 0, label.hjust = 0.1, nrow = 1)) +

  # scale_fill_manual(values = pal[1:4], name="Median",
  #                   guide = guide_legend(keyheight = unit(2, units = "mm"), 
  #                                        keywidth = unit(18, units = "mm"), 
  #                                        title.position = "top", label.position = "bottom",
  #                                        title.hjust = 0, label.hjust = 0.1, nrow = 1)) +
  
    labs(title="Median Number of Crimes Against Children", 
       subtitle = "By State, 2001 to 2012",
       x = "", y = "") +
  map_theme + 
  theme(plot.title = element_text(hjust=0),
        legend.box.just = "top",
        legend.background = element_blank(),
        legend.position = c(0.7,0))

g <- tableGrob(median.table, rows=NULL, theme = myt) 

median_plot + annotation_custom(g, xmin=88, xmax=98, ymin=8, ymax=18) + coord_cartesian()

Similar to the faceted bar charts (above) depicting total crime by state, Madhya Pradesh and Maharashtra have had consistently high crime with little variance. Uttar Pradesh has had significant variance from year to year, but still falls in the top three in terms of median number of crimes.

Next steps . . .

Any time I see maps like the ones I just made, I am reminded of this comic from xkcd:

xkcd

Comparing growth rates in crime versus population would likely yield a much better assessment of crime rates, but I haven’t found the right data (yet).

Ideally, I’d like to get current crime and total population data. By city would also be great. If I can find this data, I’ll put together another post.