WebSVN – Boeken.quickpython – /qpbe3e/exercise_answers/Ch24/Quick Python Book, 3rd ed, Chapter 24.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 24 Exploring data\n",
    "This chapter covers\n",
    "* Python’s advantages for handling data\n",
    "* Jupyter Notebook\n",
    "* Pandas\n",
    "* Data aggregation \n",
    "* Plotting with mathplotlib\n",
    "\n",
    "Over the last few chapters we’ve dealt with some aspects of using Python to get and to clean data. Now it’s time to look at a few of the things that Python can help you do to manipulate and explore data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 24.1\tPython tools for data exploration\n",
    "In this chapter we’ll look at some common Python tools for data exploration, like Jupyter notebook, pandas, and matplotlib. We can only touch briefly on a few features of these tools, but the aim is to give you an idea of what might be possible and some initial tools to use in exploring data with Python. \n",
    "### 24.1.1\tPython’s advantages for exploring data\n",
    "Python has become one of the leading languages for data science, and continues to grow in that area. As we have mentioned Python is not always the fastest in terms of raw performance. On the other hand, some data crunching libraries, like NumPy, for example, are largely written in C and heavily optimized to the point that speed is not an issue. In addition, other considerations, such as readability and accessibility often outweigh pure speed – minimizing the amount of developer time and needed is often more important. Python is readable and accessible, and both on its own and in combination with tools developed in the Python community, it is an enormously powerful tool for manipulating and exploring data. \n",
    "### 24.1.2\tPython can be better than spreadsheet\n",
    "Spreadsheets have been the tool of choice for adhoc data manipulation for decades. People who are skilled with spreadsheets can make them do truly impressive tricks – they can combine different but related data sets, they can pivot tables, they can use lookup tables to link data sets, and much more. But while people everywhere get a vast amount of work done with them every day, spreadsheets do have limitations, and Python can help you go beyond those limitations. \n",
    "\n",
    "One limitation already alluded to is the fact that most spreadsheet software has a row limit, currently about a million rows, which is often not enough for current data sets. Another limitation is the central metaphor of the spreadsheet itself. Spreadsheets are two dimensional grids, rows and columns, or at best stacks of grids, and this limits the ways one can manipulate and think about complex data. With Python you can code your way around the limitations of spreadsheets and manipulate data the way you want. You can combine Python data structures like lists, tuples, sets, and dictionaries in endlessly flexible ways, or you can create your own classes to package both data and behavior in exactly the way you need. \n",
    "## 24.2\tJupyter Notebook\n",
    "Probably one of the most compelling tools for exploring data with Python doesn’t augment what the language itself does, but instead changes the way you can use the language to interact with your data. Jupyter notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. While several other languages are now supported, it originated in connection with IPython, an alternative shell for Python that was developed by the scientific community. \n",
    "\n",
    "What makes Jupyter such a convenient and power tool is that you interact with it in a web browser, itlets you combine text and code, and you can modify and execute your code interactively. Not only can you run and modify code in chunks, you can also save and share the notebooks with others. \n",
    "\n",
    "The best way to get a feel for what Jupyter notebooks can do is to start playing with one. It’s fairly easy to run a Jupyter process locally on your machine, or you can access online versions. For some options, see the sidebar on ways to run Jupyter.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ways to Run Jupyter\n",
    "Jupyter online: accessing online instances of Jupyter is one of the easiest ways to get started. Currently Project Jupyter, the community behind Jupyter, hosts free notebooks at https://try.jupyter.org, where you can also find demo notebooks and kernels for other languages. At the time of this writing, you could also access free notebooks on Microsoft’s Azure platform at https://notebooks.azure.com, and there are many other ways.\n",
    "\n",
    "**Jupyter locally:** while using an online instance is quite convenient, it’s not very much work to set up your own instance of Jupyter on your computer. Usually for local versions you point your browser to `localhost:8888`\n",
    "\n",
    "If you use Docker, there are several containers to choose from, and running the data science notebook container would be something like:\n",
    "```\n",
    "docker run -it --rm -p 8888:8888 jupyter/datascience-notebook\n",
    "```\n",
    "If you’d rather run directly on your system, it’s easy to install and run Jupyter in a virtualenv.\n",
    "\n",
    "**MacOSX and linux systems:**\n",
    "\n",
    "First, open a command window and enter the following commands:\n",
    "```\n",
    " > python3 -m venv jupyter\n",
    " > cd jupyter\n",
    " > source jupyter/bin/activate\n",
    " > pip install jupyter\n",
    " > jupyter-notebook\n",
    "```\n",
    "**Windows systems:**\n",
    "```\n",
    " > python3 -m venv jupyter\n",
    " > cd jupyter\n",
    " > Scripts/bin/activate\n",
    " > pip install jupyter\n",
    " > Scripts/jupyter-notebook\n",
    "```\n",
    "The last command should run the jupyter notebook web app and open a browser window pointing at it. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  24.2.1\tStarting a kernel\n",
    "Once you have Jupyter installed and running and open in your browser, you will need to start up a Python kernel. One of the nice things about Jupyter is that you can run multiple kernels at the same time and you can run kernels for different versions of Python and for other languages like R, Julia, even Ruby.\n",
    "\n",
    "Starting a kernel is easy – just click on the new button, and select Python 3.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 24.2.2\tExecuting code in a cell\n",
    "Once you have a kernel running, you can start entering and running Python code. Right away, you’ll notice a few differences from the ordinary Python command shell. You won’t get the >>> prompt that you see in the standard Python shell, and pressing enter will just add new lines in the cell. To execute the code in a cell, you can use the Cell menu and choose “Run Cells”, you can use the “run” button immediately to the left of the down arrow in the button bar, or you can use the key combination alt-enter. After you use Jupyter notebook a little bit, it’s quite like that the alt-enter key combination will become quite natural. \n",
    "\n",
    "You can test out how it works, but entering some code or an expression into the first cell of your new notebook and then pressing alt-enter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see any output is shown immediately below the cell and new cell is created and ready for your next input. Also note that each cell that is executed is numbered in the order that it is executed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TRY THIS: USING JUPYTER NOTEBOOK\n",
    "Enter some code in the notebook and experiment with running it. Check out the Edit, Cell, and Kernel menus to see what options are there. Once you have a little code running, use the Kernel menu to restart the kernel, then repeat your steps,  then use the cell menu to rerun the code in all of the cells.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 24.3\tPython and Pandas\n",
    "In the course of exploring and manipulating data there are quite a few common operations – loading data into a list or dictionary, cleaning data, filtering data, and so on. Most of these operations are repeated often, have standard patterns for how they should be done, and are simple and often tedious. If you think that combination is a strong suggestion that those tasks should be automated, you’re not alone. One of the now standard tools for handling data in Python, Pandas, was created to automate the boring heavy lifting of handling data sets.\n",
    "\n",
    "### 24.3.1\tWhy you might want to use Pandas\n",
    "Pandas was created to make manipulating and analyzing tablular or relational data easy by providing a standard framework holding the data with convenient tools for frequent operations. As such, it’s almost more of an extension to Python rather than a library, and it changes the way you can interact with data. The plus side is that once you grok the way that pandas does things, you can do some impressive things and save a lot of time, but on the other hand, it does take time to learn how to get the most from pandas. As with many tools, if you use pandas for what it was designed for, it excels. The simple examples we’ll show you below should give you a rough idea if pandas is a tool suited for your use cases.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 24.3.2\tInstalling Pandas\n",
    "Pandas can be easily installed using pip, and is often used along with matplotlib for plotting, so you can install them from the command line of your jupyter virtual environment with:\n",
    "```\n",
    "> pip install pandas matplotlib\n",
    "```\n",
    "Or from a cell in a jupyter notebook you can use:\n",
    "```\n",
    "In [ ]: !pip install pandas matplotlib\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: pandas in /home/naomi/Documents/QPB3E/case_study/lib/python3.6/site-packages\n",
      "Requirement already satisfied: matplotlib in /home/naomi/Documents/QPB3E/case_study/lib/python3.6/site-packages\n",
      "Requirement already satisfied: numpy in /home/naomi/Documents/QPB3E/case_study/lib/python3.6/site-packages\n",
      "Requirement already satisfied: python-dateutil>=2 in /home/naomi/Documents/QPB3E/case_study/lib/python3.6/site-packages (from pandas)\n",
      "Requirement already satisfied: pytz>=2011k in /home/naomi/Documents/QPB3E/case_study/lib/python3.6/site-packages (from pandas)\n",
      "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /home/naomi/Documents/QPB3E/case_study/lib/python3.6/site-packages (from matplotlib)\n",
      "Requirement already satisfied: cycler>=0.10 in /home/naomi/Documents/QPB3E/case_study/lib/python3.6/site-packages (from matplotlib)\n",
      "Requirement already satisfied: six>=1.10 in /home/naomi/Documents/QPB3E/case_study/lib/python3.6/site-packages (from matplotlib)\n"
     ]
    }
   ],
   "source": [
    "!pip install pandas matplotlib numpy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use pandas life will be easier if we use the following three lines.\n",
    "```\n",
    "%matplotlib inline\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "```\n",
    "The first line is a jupyter “magic” function which enables matplotlib to plot data in cell where your code is and will be very useful. The second line imports pandas with the alias of pd, which is both easier to type and common among pandas users, while the las line also imports numpy. While pandas depends quite a bit on numpy, we won’t explicitly be using it in the examples below, but it’s probably reasonable to get in the habit of importing it anyway."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 24.3.3\tData Frames\n",
    "One of the basic structures you get with Pandas is a data frame. A data frame is a two dimensional grid, rather similar to a relational database table, except in memory. Creating a data frame is easy – you just give it some data. To keep things absolutely simple, we’ll give it a 3x3 grid of numbers as our first example. In Python such a grid would be a list of lists:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1, 2, 3], [4, 5, 6], [7, 8, 9]]\n"
     ]
    }
   ],
   "source": [
    "grid = [[1,2,3], [4,5,6], [7,8,9]]\n",
    "print(grid)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sadly, in Python our grid won’t look like a grid unless we make some additional effort. So let’s see what we can do with the same grid as a pandas data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   0  1  2\n",
      "0  1  2  3\n",
      "1  4  5  6\n",
      "2  7  8  9\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "df = pd.DataFrame(grid)\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That was fairly straightforward – all we needed to do was turn our grid into a dataframe and we’ve already gained a more grid-like display, and we now have both row and column numbers. Of course, it’s often rather bothersome to keep track of what column number is what, so let’s give our columns names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   one  two  three\n",
      "0    1    2      3\n",
      "1    4    5      6\n",
      "2    7    8      9\n"
     ]
    }
   ],
   "source": [
    "df = pd.DataFrame(grid, columns=[\"one\", \"two\", \"three\"] )\n",
    "print(df)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You may be wondering if naming the columns has any benefit, but the column names can be put to use with another of pandas’ tricks, the ability to select columns by name. If we want the contents only of column “two” for example, we can get that very simply:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    2\n",
      "1    5\n",
      "2    8\n",
      "Name: two, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print(df[\"two\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we’ve already saved over Python – to get only column two of our grid we’d need to use a list comprehension, while also remembering to use a zero based index (and we still wouldn’t get the nice output)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[2, 5, 8]\n"
     ]
    }
   ],
   "source": [
    "print([x[1] for x in grid])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we can loop over data frame column values just as easily as the list we got by using a comprehension:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2\n",
      "5\n",
      "8\n"
     ]
    }
   ],
   "source": [
    "for x in df[\"two\"]:\n",
    "    print(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That’s not bad for a start, but by using a list of columns in double brackets we can do better, and get a subset of the data frame that is another data frame. Instead of the middle column, lets get the first and last columns of our data frame as another data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   one  three\n",
      "0    1      3\n",
      "1    4      6\n",
      "2    7      9\n"
     ]
    }
   ],
   "source": [
    "edges = df[[\"one\", \"three\"]]\n",
    "print(edges)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A data frame also has a number of methods that will apply the same operation and argument to every item in the frame. For example, if we wanted to add two to every item in the data frame edges, we could just use the `add()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   one  three\n",
      "0    3      5\n",
      "1    6      8\n",
      "2    9     11\n"
     ]
    }
   ],
   "source": [
    "print(edges.add(2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2    1\n",
       "5    1\n",
       "8    1\n",
       "Name: two, dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['two'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here again, it’s possible to get the same result using list comprehensions and/or nested loops, but it’s not as convenient. It’s pretty easy to see how such functionality can make life easier, particularly for someone who is more interested in the information the data contains than in the process of manipulating it. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 24.4\tData Cleaning\n",
    "In earlier chapters we have seen a few ways that Python can be used to clean data. Now that we have pandas added in to the mix, we’ll look at some examples of how to use some its functionality to clean data.  As we go through the following operations we’ll also refer to ways that the same operation might be done in plain Python, both to illustrate how using pandas is different, and also because pandas is not right for every use case (or user, for that matter). \n",
    "### 24.4.1\tLoading and saving data with pandas\n",
    "Pandas has an impressive collection of methods to load data from different sources. Not only are a number of file formats supported including fixed width and delimited text files, spreadsheets, JSON, XML, and HTML, but it’s also possible to read from SQL databases, Google BiqQuery, HDF, even clipboard data, and more. You should be aware that many of these operations are not actually part of pandas itself, but rather pandas relies on having other libraries installed to handle those operations, such as SQLAlchemy for reading from SQL databases. This matters mostly if something goes wrong, where quite often the problem that needs to be fixed is outside of pandas and you are left to deal with the underlying library. \n",
    "\n",
    "For example, reading a JSON file is simple using the `read_json()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                                   report\n",
      "abs_humidity                         None\n",
      "atmo_opacity                        Sunny\n",
      "ls                                    296\n",
      "max_temp                               -1\n",
      "max_temp_fahrenheit                  30.2\n",
      "min_temp                              -72\n",
      "min_temp_fahrenheit                 -97.6\n",
      "pressure                              869\n",
      "pressure_string                    Higher\n",
      "season                           Month 10\n",
      "sol                                  1576\n",
      "sunrise              2017-01-11T12:31:00Z\n",
      "sunset               2017-01-12T00:46:00Z\n",
      "terrestrial_date               2017-01-11\n",
      "wind_direction                         --\n",
      "wind_speed                           None\n"
     ]
    }
   ],
   "source": [
    "mars = pd.read_json(\"mars_data_01.json\")\n",
    "print(mars)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For another example of how simple reading data into pandas is we’ll load some data from the CSV file of temperature data we used in Chapter 21 and from the JSON file of Mars weather data we used in chapter 22. In the first case we’ll use the `read_csv()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "           4      5    6     7     8      9    10    11    12       13   14  \\\n",
      "0  1979/01/01  17.48  994   6.0  30.5   2.89  994 -13.6  15.8  Missing    0   \n",
      "1  1979/01/02   4.64  994  -6.4  15.8  -9.03  994 -23.6   6.6  Missing    0   \n",
      "2  1979/01/03  11.05  994  -0.7  24.7  -2.17  994 -18.3  12.9  Missing    0   \n",
      "3  1979/01/04   9.51  994   0.2  27.6  -0.43  994 -16.3  16.3  Missing    0   \n",
      "4  1979/05/15  68.42  994  61.0  75.1  51.30  994  43.3  57.0  Missing    0   \n",
      "5  1979/05/16  70.29  994  63.4  73.5  48.09  994  41.1  53.0  Missing    0   \n",
      "6  1979/05/17  75.34  994  64.0  80.5  50.84  994  44.3  55.7    82.60    2   \n",
      "7  1979/05/18  79.13  994  75.5  82.1  55.68  994  50.0  61.1    81.42  349   \n",
      "8  1979/05/19  74.94  994  66.9  83.1  58.59  994  50.9  63.2    82.87   78   \n",
      "\n",
      "        15       16      17  \n",
      "0  Missing  Missing   0.00%  \n",
      "1  Missing  Missing   0.00%  \n",
      "2  Missing  Missing   0.00%  \n",
      "3  Missing  Missing   0.00%  \n",
      "4  Missing  Missing   0.00%  \n",
      "5  Missing  Missing   0.00%  \n",
      "6    82.40    82.80   0.20%  \n",
      "7    80.20    83.40  35.11%  \n",
      "8    81.60    85.20   7.85%  \n"
     ]
    }
   ],
   "source": [
    "temp = pd.read_csv(\"temp_data_01.csv\", header=0, names=range(18), usecols=range(4,18))\n",
    "\n",
    "print(temp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**#A Note that the “\\” at the end of the header line is an indication that the table is too long to be printed on one line, and more columns are printed below.**\n",
    "\n",
    "Clearly, loading the file in a single step is appealing, and we can see that pandas has had no issues loading the file. You can also see that the empty first column has been translated into “NaN” or Not a Number”. We do still have the same issue with “Missing” for some values and in fact it might make sense to have those “Missing” values also converted to NaN. \n",
    "```\n",
    "temp = pd.read_csv(\"temp_data_01.csv\", na_values=['Missing'])\n",
    "```\n",
    "\n",
    "The addition of the na_values parameter controls what values will be translated to NaN on load. In this case we added the string \"Missing\" so the row of the data frame was translated from:\n",
    "```\n",
    "NaN  Illinois  17  Jan 01, 1979  1979/01/01  17.48  994  6.0  30.5  2.89\t994  -13.6  15.8  Missing  0  Missing  Missing  0.00%\n",
    "```\n",
    "to\n",
    "```\n",
    "NaN  Illinois  17  Jan 01, 1979  1979/01/01  17.48  994  6.0  30.5  2.89\t994  -13.6  15.8  NaN\t0  NaN  NaN  0.00%\n",
    "```\n",
    "This can be particularly useful if you have one of those data files where for whatever reason \"no data\" is indicated in a variety of ways – “NA”, “N/A”, “?”, “-“, and so on. To handle a case like that you can inspect the data to find out what is used and then reload it using the na_values parameter to standardize all of those variations as NaN. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### SAVING DATA\n",
    "If you want to save the contents of a data frame, a pandas data frame has a similarly broad collection of methods. For example if we take our simple grid data frame above, we can write it in a number of ways.\n",
    "```\n",
    "df.to_csv(\"df_out.csv\", index=False)      #A\n",
    "```\n",
    "**#A Setting index to False means that the row indexes will not be written**\n",
    "\n",
    " which will write a file that looks like this:\n",
    " ```\n",
    "one,two,three\n",
    "1,2,3\n",
    "4,5,6\n",
    "7,8,9\n",
    "```\n",
    "Similarly, we can transform a data grid to a JSON object or write it to a file.\n",
    "```\n",
    "df.to_json()      #A\n",
    "'{\"one\":{\"0\":1,\"1\":4,\"2\":7},\"two\":{\"0\":2,\"1\":5,\"2\":8},\"three\":{\"0\":3,\"1\":6,\"2\":9}}'\n",
    "```\n",
    "**#A Supplying a file path as an argument will write the JSON to that file rather than returning it.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 24.4.2\tData Cleaning with a data frame\n",
    "Converting a particular set of values to NaN on load is a very simple bit of data cleaning that pandas makes trivial. Going beyond that, data frames support several operations that can make data cleaning less of a chore. To see how this works, let’s reopen the temperature CSV file we looked at above, but this time, instead of using the headers to name the columns, we can use the range() function with the names parameter to give them numbers, which will make referring to them easier. You also may recall from our previous example that the first field of every line, the “Notes” field, was empty, and was loaded with NaN values. While we could just ignore this column, it would be even easier if we didn’t have it. We can use the range() function again, this time starting from 1, to tell pandas load all columns except the first one. But in fact, if we know that all of our values are from Illinois and we don’t care about the long form date field, we could actually start from 4, and make things much more manageable.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "           4      5    6     7     8      9    10    11    12     13   14  \\\n",
      "0  1979/01/01  17.48  994   6.0  30.5   2.89  994 -13.6  15.8    NaN    0   \n",
      "1  1979/01/02   4.64  994  -6.4  15.8  -9.03  994 -23.6   6.6    NaN    0   \n",
      "2  1979/01/03  11.05  994  -0.7  24.7  -2.17  994 -18.3  12.9    NaN    0   \n",
      "3  1979/01/04   9.51  994   0.2  27.6  -0.43  994 -16.3  16.3    NaN    0   \n",
      "4  1979/05/15  68.42  994  61.0  75.1  51.30  994  43.3  57.0    NaN    0   \n",
      "5  1979/05/16  70.29  994  63.4  73.5  48.09  994  41.1  53.0    NaN    0   \n",
      "6  1979/05/17  75.34  994  64.0  80.5  50.84  994  44.3  55.7  82.60    2   \n",
      "7  1979/05/18  79.13  994  75.5  82.1  55.68  994  50.0  61.1  81.42  349   \n",
      "8  1979/05/19  74.94  994  66.9  83.1  58.59  994  50.9  63.2  82.87   78   \n",
      "\n",
      "     15    16      17  \n",
      "0   NaN   NaN   0.00%  \n",
      "1   NaN   NaN   0.00%  \n",
      "2   NaN   NaN   0.00%  \n",
      "3   NaN   NaN   0.00%  \n",
      "4   NaN   NaN   0.00%  \n",
      "5   NaN   NaN   0.00%  \n",
      "6  82.4  82.8   0.20%  \n",
      "7  80.2  83.4  35.11%  \n",
      "8  81.6  85.2   7.85%  \n"
     ]
    }
   ],
   "source": [
    "temp = pd.read_csv(\"temp_data_01.csv\", na_values=['Missing'], header=0, names=range(18), usecols=range(4,18))\n",
    "print(temp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**#A Setting header=0 turns off reading the header for column labels.**\n",
    "\n",
    "Now we have a data frame that has only the columns we might want to work with, but we still have an issue that the last column, the percentage of coverage for the heat index, is still a string ending with a percent sign, rather than an actual percentage. This is apparent if we just look at the first row’s value for column 17."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'0.00%'"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "temp[17][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To fix this we need to do two things. First we need to remove the “%” from the end of the value. Then, we’ll need to cast the value from string to a number. Optionally, if we want to represent the resulting percentage as a fraction, we’d need to divide it by 100. The first bit is simple because pandas will let us use a single command to repeat an operation on a column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'0.00'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "temp[17]=temp[17].str.strip(\"%\")\n",
    "temp[17][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This takes the column and calls a string `strip()` operation on it to remove the trailing “%”. Now when we look at the first value in the column (or any of the other values) we can see that the offending percent sign is gone. It’s also worth noting that we could have used other operations, such as `replace(\"%\", \"\")`, to achieve the same result.\n",
    "\n",
    "The second operation is to convert the string to a numeric value. Again, pandas will let us perform this operation with one command.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'0.00'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "temp[17][0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The values in column 17 are now numeric, and if we want to we can use the div() method to finish the job of turning those values into fractions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    0.0000\n",
       "1    0.0000\n",
       "2    0.0000\n",
       "3    0.0000\n",
       "4    0.0000\n",
       "5    0.0000\n",
       "6    0.0020\n",
       "7    0.3511\n",
       "8    0.0785\n",
       "Name: 17, dtype: float64"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "temp[17] = temp[17].div(100)\n",
    "temp[17]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In fact, it would be possible achieve the same result in a single line by chaining the three operations together.\n",
    "```\n",
    "temp[17] = pd.to_numeric(temp[17].str.strip(\"%\")).div(100)\n",
    "```\n",
    "This example is very simple, but it gives you an idea of the convenience that pandas can bring to cleaning your data. Pandas has a wide variety of operations for transforming data, as well as the ability to use custom functions, so it would be hard to think of a scenario where you couldn’t streamline data cleaning with pandas. While the number of options is almost overwhelming there is also a wide availability of tutorials and videos available, and the documentation at pandas.pydata.org is also excellent.\n",
    "\n",
    "### TRY THIS: CLEANING DATA WITH AND WITHOUT PANDAS\n",
    "Experiment with the operations mentioned above. Once the final column has been converted to a fraction, can you think of a way to convert it back to string with the trailing percent sign?\n",
    "\n",
    "For contrast, load the same data into a plain Python list of using the csv module, and apply the same changes using just plain Python.  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 24.5\tData aggregation and manipulation\n",
    "The examples above have probably given you some idea of the many options pandas gives you for doing fairly complex operations on your data with only a few commands. As you might expect this level of functionality is also available for aggregating data. In this section we’ll walk through some simple examples of aggregating data to illustrate some of the many possibilities. While there are many options, we’ll focus on merging data frames, simple data aggregation, and grouping and filtering.\n",
    "\n",
    "### 24.5.1\tMerging data frames\n",
    "Quite often in the course of handling data we need to relate two different data sets. Suppose that we have one file containing the number of sales calls made per month by members of a sales team, and in another file we the dollar amounts of the sales from each of their territories. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   Team member  Territory  Month  Calls\n",
      "0        Jorge          3      1    107\n",
      "1        Jorge          3      2     88\n",
      "2        Jorge          3      3     84\n",
      "3        Jorge          3      4    113\n",
      "4          Ana          1      1     91\n",
      "5          Ana          1      2    129\n",
      "6          Ana          1      3     96\n",
      "7          Ana          1      4    128\n",
      "8          Ali          2      1    120\n",
      "9          Ali          2      2     85\n",
      "10         Ali          2      3     87\n",
      "11         Ali          2      4     87\n"
     ]
    }
   ],
   "source": [
    "calls = pd.read_csv(\"sales_calls.csv\")\n",
    "print(calls)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    Territory  Month  Amount\n",
      "0           1      1   54228\n",
      "1           1      2   61640\n",
      "2           1      3   43491\n",
      "3           1      4   52173\n",
      "4           2      1   36061\n",
      "5           2      2   44957\n",
      "6           2      3   35058\n",
      "7           2      4   33855\n",
      "8           3      1   50876\n",
      "9           3      2   57682\n",
      "10          3      3   53689\n",
      "11          3      4   49173\n"
     ]
    }
   ],
   "source": [
    "revenue = pd.read_csv(\"sales_revenue.csv\")\n",
    "print(revenue)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clearly it would be very useful to be able to link revenue and team member activity. These two files are very simple, yet merging them with plain Python is not entirely trivial. However, pandas actually has a function to merge two data frames.\n",
    "\n",
    "The merge function creates a new data frame by joining the two frames on the columns specified in the column field. The merge function works similarly to a relational database join, giving us a table that combines the columns from the two files. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   Team member  Territory  Month  Calls  Amount\n",
      "0        Jorge          3      1    107   50876\n",
      "1        Jorge          3      2     88   57682\n",
      "2        Jorge          3      3     84   53689\n",
      "3        Jorge          3      4    113   49173\n",
      "4          Ana          1      1     91   54228\n",
      "5          Ana          1      2    129   61640\n",
      "6          Ana          1      3     96   43491\n",
      "7          Ana          1      4    128   52173\n",
      "8          Ali          2      1    120   36061\n",
      "9          Ali          2      2     85   44957\n",
      "10         Ali          2      3     87   35058\n",
      "11         Ali          2      4     87   33855\n"
     ]
    }
   ],
   "source": [
    "calls_revenue = pd.merge(calls, revenue, on=['Territory', 'Month'])\n",
    "print(calls_revenue)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While in this case we have a one-to-one correspondence between the rows in the two fields, but the merge function can also do one-to-many and many-to-many joins and right and left joins.\n",
    "\n",
    "### QUICK CHECK: MERGING DATA SETS\n",
    "How would you go about actually merging to data sets like the above in Python?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 24.5.2\tSelecting data\n",
    "It can also be useful to select or filter the rows in a data frame based on some condition. In our example sales data we might want to look only at territory 3, say. This is also easy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Team member  Territory  Month  Calls  Amount\n",
      "0       Jorge          3      1    107   50876\n",
      "1       Jorge          3      2     88   57682\n",
      "2       Jorge          3      3     84   53689\n",
      "3       Jorge          3      4    113   49173\n"
     ]
    }
   ],
   "source": [
    "print(calls_revenue[calls_revenue.Territory==3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example we selected only rows where the territory was equal to 3 but using exactly that expression, `revenue.Territory==3`, as the index for the data frame. From the point of view of plain Python such a usage is nonsense and illegal, but for a pandas data frame it works and makes for a much more concise expression. \n",
    "\n",
    "More complex expressions are also allowed, of course. If we wanted to select only rows where the amount per call was greater than 500, we could use that expression instead.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  Team member  Territory  Month  Calls  Amount\n",
      "1       Jorge          3      2     88   57682\n",
      "2       Jorge          3      3     84   53689\n",
      "4         Ana          1      1     91   54228\n",
      "9         Ali          2      2     85   44957\n"
     ]
    }
   ],
   "source": [
    "print(calls_revenue[calls_revenue.Amount/calls_revenue.Calls>500])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Even better, we could calculate and add that column to our data frame using a similar operation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   Team member  Territory  Month  Calls  Amount  Call_Amount\n",
      "0        Jorge          3      1    107   50876   475.476636\n",
      "1        Jorge          3      2     88   57682   655.477273\n",
      "2        Jorge          3      3     84   53689   639.154762\n",
      "3        Jorge          3      4    113   49173   435.159292\n",
      "4          Ana          1      1     91   54228   595.912088\n",
      "5          Ana          1      2    129   61640   477.829457\n",
      "6          Ana          1      3     96   43491   453.031250\n",
      "7          Ana          1      4    128   52173   407.601562\n",
      "8          Ali          2      1    120   36061   300.508333\n",
      "9          Ali          2      2     85   44957   528.905882\n",
      "10         Ali          2      3     87   35058   402.965517\n",
      "11         Ali          2      4     87   33855   389.137931\n"
     ]
    }
   ],
   "source": [
    "calls_revenue['Call_Amount'] = calls_revenue.Amount/calls_revenue.Calls\n",
    "print(calls_revenue)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, note how pandas built in logic replaces a more cumbersome structure in plain Python.\n",
    "\n",
    "### QUICK CHECK: SELECTING IN PYTHON\n",
    "What Python code structure would you use to select only rows meeting certain conditions? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 24.5.3\tGrouping and aggregation\n",
    "As you might expect, pandas also has plenty of tools to summarize and aggregate data as well. In particular, getting the sum, mean, median, minimum, and maximum values from a column uses clearly named column methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1215\n",
      "101.25\n",
      "93.5\n",
      "129\n",
      "84\n"
     ]
    }
   ],
   "source": [
    "print(calls_revenue.Calls.sum())\n",
    "print(calls_revenue.Calls.mean())\n",
    "print(calls_revenue.Calls.median())\n",
    "print(calls_revenue.Calls.max())\n",
    "print(calls_revenue.Calls.min())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If, for example, we want to get all of the rows where the amount per call was above the median, we can combine this trick with the selection operation above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "464.2539427570093\n",
      "  Team member  Territory  Month  Calls  Amount  Call_Amount\n",
      "0       Jorge          3      1    107   50876   475.476636\n",
      "1       Jorge          3      2     88   57682   655.477273\n",
      "2       Jorge          3      3     84   53689   639.154762\n",
      "4         Ana          1      1     91   54228   595.912088\n",
      "5         Ana          1      2    129   61640   477.829457\n",
      "9         Ali          2      2     85   44957   528.905882\n"
     ]
    }
   ],
   "source": [
    "print(calls_revenue.Call_Amount.median())\n",
    "print(calls_revenue[calls_revenue.Call_Amount >= calls_revenue.Call_Amount.median()])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to being able to pick out summary values, it’s also often useful to group the data based on other columns. In our simple example we can use the groupby method to group our data. For example, we might want to know the total calls and amounts by month, or by territory. In those cases, we just use those fields with the data frame’s groupby method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       Calls  Amount\n",
      "Month               \n",
      "1        318  141165\n",
      "2        302  164279\n",
      "3        267  132238\n",
      "4        328  135201\n"
     ]
    }
   ],
   "source": [
    "print(calls_revenue[['Month', 'Calls', 'Amount']].groupby(['Month']).sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "           Calls  Amount\n",
      "Territory               \n",
      "1            444  211532\n",
      "2            379  149931\n",
      "3            392  211420\n"
     ]
    }
   ],
   "source": [
    "print(calls_revenue[['Territory', 'Calls', 'Amount']].groupby(['Territory']).sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In each case, we select the columns we want to aggregate, and then we group them by the values in one of those columns, and (in this case) sum the values for each group. We could also have used any of the other methods we mentioned above.\n",
    "\n",
    "Again, all of these examples are simple, but they illustrate a few of the options you have for manipulating and selecting data using pandas. If these ideas resonate with your needs you can learn more by studying the pandas documentation at pandas.pydata.org.\n",
    "\n",
    "### TRY THIS: GROUPING AND AGGREGATING\n",
    "Experiment with pandas and the data above. Can you get the calls and amounts by both team member and month?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 24.6\tPlotting data\n",
    "Another very attractive feature of pandas is the ability to plot the data in a data frame very easily. While there are many options for plotting data in Python and Jupiter notebook, pandas can use matplotlib directly from a data frame. You may recall that when we started our Jupyter session one of the first commands we gave was the Jupyter magic command to enable matplotlib for inline plotting.\n",
    "```\n",
    "%matplotlib inline\n",
    "```\n",
    "Since we have the ability to plot, lets see how we might plot some data. To continue with our sales example from above, if we wanted to plot the \tquarter’s mean sales by territory, we can get a graph right in our notebook, just by adding `.plot.bar().`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x7faaa5ca5da0>"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEGCAYAAABrQF4qAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4wLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvpW3flQAAEAhJREFUeJzt3X9sXeV9gPHnG2IwEz/SBhOqmNWw\nBvGjgQS8NAyta0mrAkUEUIsobEQILUJiCtkmbSxaVdBgAmkatNJWGg3WlNEBo4uStWgdkDA2raU4\nJYTQrCNFRHHWJE4KtBkNIfS7P/wamZDga/vaN379fCTL57zn3HNfx8qTm3PPvTcyE0lSvaa0egKS\npLFl6CWpcoZekipn6CWpcoZekipn6CWpcoZekipn6CWpcoZekio3tdUTADjhhBOyq6ur1dOQpAll\n3bp1uzKzY6j9DovQd3V10dPT0+ppSNKEEhFbGtnPUzeSVDlDL0mVM/SSVLnD4hy9JA3XW2+9RW9v\nL3v37m31VMZce3s7nZ2dtLW1jej2hl7ShNTb28uxxx5LV1cXEdHq6YyZzGT37t309vZyyimnjOgY\nnrqRNCHt3buX6dOnVx15gIhg+vTpo/qfi6GXNGHVHvkBo/05Db0kVW5SnqPvuuU7rZ7CmHrlzs+2\negrSuGv23+tG/h5t376dpUuX8uyzzzJt2jRmzJjBPffcw2mnnXbQ/Y855hj27NnDK6+8wqWXXsrG\njRubOudDmZShl6TRykyuuOIKFi1axEMPPQTA888/z44dOw4Z+lbx1I0kjcDatWtpa2vjxhtvfGfs\nnHPOYe7cuSxYsIBzzz2X2bNns2rVqvc9zosvvsi8efOYM2cOZ599Ni+99FLT5+ojekkagY0bN3Le\neee9Z7y9vZ2VK1dy3HHHsWvXLubPn89ll112yCdU7733Xm6++WauvfZa9u3bx9tvv930uRp6SWqi\nzGTZsmU8/fTTTJkyhW3btrFjxw5OOumkg+5//vnnc8cdd9Db28uVV17JrFmzmj4nT91I0gicddZZ\nrFu37j3jDz74IH19faxbt47169czY8aM970G/pprrmH16tUcffTRXHLJJaxZs6bpczX0kjQCF154\nIW+++SbLly9/Z2zDhg1s2bKFE088kba2NtauXcuWLe//TsIvv/wyp556KkuWLGHhwoVs2LCh6XP1\n1I2kKoz3ZcURwcqVK1m6dCl33XUX7e3tdHV1ceutt7JkyRJmz55Nd3c3p59++vse55FHHuGBBx6g\nra2Nk046iWXLljV/rpnZ9IMOV3d3d47nB494Hb008W3atIkzzjij1dMYNwf7eSNiXWZ2D3VbT91I\nUuUMvSRVztBLmrAOh1PP42G0P6ehlzQhtbe3s3v37upjP/B+9O3t7SM+hlfdSJqQOjs76e3tpa+v\nr9VTGXMDnzA1UoZe0oTU1tY24k9cmmw8dSNJlTP0klQ5Qy9JlWs49BFxREQ8FxHfLuunRMQzEbE5\nIh6OiCPL+FFlfXPZ3jU2U5ckNWI4j+hvBjYNWr8LuDszPwK8CtxQxm8AXi3jd5f9JEkt0lDoI6IT\n+Czwd2U9gAuBR8suK4DLy/LCsk7ZviAmy0e1S9JhqNFH9PcAfwL8qqxPB17LzP1lvReYWZZnAlsB\nyvbXy/7vEhGLI6InInomw3WwktQqQ4Y+Ii4Fdmbme99hfxQyc3lmdmdmd0dHRzMPLUkapJEXTF0A\nXBYRlwDtwHHAl4FpETG1PGrvBLaV/bcBJwO9ETEVOB7Y3fSZS5IaMuQj+sz8s8zszMwu4GpgTWZe\nC6wFPld2WwQMfNT56rJO2b4ma38zCkk6jI3mLRD+FHgoIm4HngPuK+P3AQ9ExGbgZ/T/4yBJfuhP\niwwr9Jn5FPBUWX4ZmHeQffYCn2/C3CRJTeArYyWpcoZekipn6CWpcoZekirnB49owqn5yo3D9aoN\nTWw+opekyhl6SaqcoZekyhl6SaqcoZekyhl6SaqcoZekyhl6SaqcoZekyhl6SaqcoZekyhl6Saqc\noZekyhl6SaqcoZekyhl6SaqcoZekyhl6SaqcoZekyhl6SaqcoZekyhl6SaqcoZekyhl6SaqcoZek\nyhl6SaqcoZekyhl6SaqcoZekyhl6SaqcoZekyhl6SarckKGPiPaI+EFEPB8RL0bEbWX8lIh4JiI2\nR8TDEXFkGT+qrG8u27vG9keQJL2fRh7RvwlcmJnnAHOAiyJiPnAXcHdmfgR4Fbih7H8D8GoZv7vs\nJ0lqkSFDn/32lNW28pXAhcCjZXwFcHlZXljWKdsXREQ0bcaSpGFp6Bx9RBwREeuBncDjwE+A1zJz\nf9mlF5hZlmcCWwHK9teB6c2ctCSpcQ2FPjPfzsw5QCcwDzh9tHccEYsjoicievr6+kZ7OEnSIQzr\nqpvMfA1YC5wPTIuIqWVTJ7CtLG8DTgYo248Hdh/kWMszszszuzs6OkY4fUnSUBq56qYjIqaV5aOB\nTwOb6A/+58pui4BVZXl1WadsX5OZ2cxJS5IaN3XoXfgQsCIijqD/H4ZHMvPbEfEj4KGIuB14Driv\n7H8f8EBEbAZ+Blw9BvOWJDVoyNBn5gZg7kHGX6b/fP2B43uBzzdldpKkUfOVsZJUOUMvSZUz9JJU\nOUMvSZUz9JJUOUMvSZUz9JJUOUMvSZUz9JJUOUMvSZUz9JJUOUMvSZUz9JJUOUMvSZUz9JJUOUMv\nSZUz9JJUOUMvSZUz9JJUOUMvSZUz9JJUOUMvSZUz9JJUOUMvSZUz9JJUOUMvSZUz9JJUOUMvSZUz\n9JJUOUMvSZUz9JJUOUMvSZUz9JJUOUMvSZUz9JJUOUMvSZUz9JJUOUMvSZUbMvQRcXJErI2IH0XE\nixFxcxn/YEQ8HhEvle8fKOMREV+JiM0RsSEizh3rH0KSdGiNPKLfD/xxZp4JzAduiogzgVuAJzNz\nFvBkWQe4GJhVvhYDX236rCVJDRsy9Jn508z8YVn+BbAJmAksBFaU3VYAl5flhcA3st/3gWkR8aGm\nz1yS1JBhnaOPiC5gLvAMMCMzf1o2bQdmlOWZwNZBN+stYwcea3FE9ERET19f3zCnLUlqVMOhj4hj\ngG8BSzPz54O3ZWYCOZw7zszlmdmdmd0dHR3DuakkaRgaCn1EtNEf+Qcz85/L8I6BUzLl+84yvg04\nedDNO8uYJKkFGrnqJoD7gE2Z+deDNq0GFpXlRcCqQePXlatv5gOvDzrFI0kaZ1Mb2OcC4PeAFyJi\nfRlbBtwJPBIRNwBbgKvKtseAS4DNwBvA9U2dsSRpWIYMfWb+JxCH2LzgIPsncNMo5yVJahJfGStJ\nlTP0klQ5Qy9JlTP0klQ5Qy9JlTP0klQ5Qy9JlTP0klQ5Qy9JlTP0klQ5Qy9JlTP0klQ5Qy9JlTP0\nklQ5Qy9JlTP0klQ5Qy9JlTP0klQ5Qy9JlTP0klQ5Qy9JlTP0klQ5Qy9JlTP0klQ5Qy9JlTP0klQ5\nQy9JlTP0klQ5Qy9JlTP0klQ5Qy9JlTP0klQ5Qy9JlTP0klQ5Qy9JlTP0klQ5Qy9JlRsy9BFxf0Ts\njIiNg8Y+GBGPR8RL5fsHynhExFciYnNEbIiIc8dy8pKkoTXyiP7rwEUHjN0CPJmZs4AnyzrAxcCs\n8rUY+GpzpilJGqkhQ5+ZTwM/O2B4IbCiLK8ALh80/o3s931gWkR8qFmTlSQN30jP0c/IzJ+W5e3A\njLI8E9g6aL/eMiZJapFRPxmbmQnkcG8XEYsjoicievr6+kY7DUnSIYw09DsGTsmU7zvL+Dbg5EH7\ndZax98jM5ZnZnZndHR0dI5yGJGkoIw39amBRWV4ErBo0fl25+mY+8PqgUzySpBaYOtQOEfGPwCeA\nEyKiF/gScCfwSETcAGwBriq7PwZcAmwG3gCuH4M5S5KGYcjQZ+YXDrFpwUH2TeCm0U5KktQ8vjJW\nkipn6CWpcoZekipn6CWpcoZekipn6CWpcoZekipn6CWpcoZekipn6CWpcoZekipn6CWpcoZekipn\n6CWpcoZekipn6CWpcoZekipn6CWpcoZekipn6CWpcoZekipn6CWpcoZekipn6CWpcoZekipn6CWp\ncoZekipn6CWpcoZekipn6CWpcoZekipn6CWpcoZekipn6CWpcoZekipn6CWpcoZekipn6CWpcmMS\n+oi4KCJ+HBGbI+KWsbgPSVJjmh76iDgC+BvgYuBM4AsRcWaz70eS1JixeEQ/D9icmS9n5j7gIWDh\nGNyPJKkBU8fgmDOBrYPWe4GPHbhTRCwGFpfVPRHx4zGYy+HiBGDXeN1Z3DVe9zQp+Lub2Gr//X24\nkZ3GIvQNyczlwPJW3f94ioiezOxu9Tw0fP7uJjZ/f/3G4tTNNuDkQeudZUyS1AJjEfpngVkRcUpE\nHAlcDaweg/uRJDWg6aduMnN/RPwB8F3gCOD+zHyx2fczwUyKU1SV8nc3sfn7AyIzWz0HSdIY8pWx\nklQ5Qy9JlTP0klS5ll1HL0nNFhHzgMzMZ8tbr1wE/HdmPtbiqbWUT8ZKg0TE6fS/uvuZzNwzaPyi\nzPzX1s1MQ4mIL9H/HltTgcfpf0X+WuDTwHcz844WTq+lDP04iojrM/PvWz0PHVxELAFuAjYBc4Cb\nM3NV2fbDzDy3lfPT+4uIF+j/vR0FbAc6M/PnEXE0/f9wn93SCbaQp27G122AoT98/T5wXmbuiYgu\n4NGI6MrMLwPR0pmpEfsz823gjYj4SWb+HCAzfxkRv2rx3FrK0DdZRGw41CZgxnjORcM2ZeB0TWa+\nEhGfoD/2H8bQTwT7IuLXMvMN4LyBwYg4HjD0aqoZwGeAVw8YD+C/xn86GoYdETEnM9cDlEf2lwL3\nA7NbOzU14OOZ+SZAZg4OexuwqDVTOjwY+ub7NnDMQCwGi4inxn86GobrgP2DBzJzP3BdRHytNVNS\nowYif5DxXYzjWxUfjnwyVpIq5wumJKlyhl6SKmfoVY2ImB4R68vX9ojYNmj9yBEe846I+GRZ/qOI\naG/urKWx5zl6VSkibgX2ZOZfDeM2R5TrsAfWp5YnYwfWe4GPZuZrIz2m1Ao+otekEBGLIuIH5dH9\n30bElIiYGhGvRcQ95fUP8yKiNyLujIjngCsi4h8i4vKI+EPgROA/IuKJcszfjYgXImJjRPxlGTvw\nmH8eEY8OmsfFEfFPLfgj0CRm6FW9iPgocAXwW5k5h/7Liq8um48Hns7MszPze2VsZ2bOzcx3gpyZ\ndwM7gd/OzE9FRCdwO/BJYC5wQbnm/l3HBP4CODsippdt19N/Xb40bgy9JoNPAb8J9ETEeuB3gN8o\n2/YBKw/Y/+EGjvkxYE1m7srMt4BvAh8/8JjlhTsPAtdExAfpf8Xmv43iZ5GGzRdMaTII+j+7+Ivv\nGoyYCvwy3/tE1f+N8v4OPOb9wLfK8sOes9d48xG9JoMngKsi4gR45+qcXx/BcX4BHFuWnwE+WY41\ncCro3w92o8zcSv8rM28Bvj6C+5VGxUf0ql5mvhARtwFPRMQU4C3gRuB/h3mo5eUYW8t5+i8CT9H/\nP4Z/yczvlOgfzDeB4zLzf0b2U0gj5+WV0jiIiHuB72XmilbPRZOPoZfGWHkC+FXgM5m5r9Xz0eRj\n6CWpcj4ZK0mVM/SSVDlDL0mVM/SSVDlDL0mV+392RSHJcm35ZwAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7faaa5fa6748>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "calls_revenue[['Territory', 'Calls']].groupby(['Territory']).sum().plot.bar()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other options are available. `plot()` alone or `.plot.line()` will create a line graph, `.plot.pie()` a pie chart, and so on. \n",
    "\n",
    "Thanks to the combination of pandas and matplotlib plotting data in a Jupyter notebook is quite easy. Of course, it should also be noted that while such plotting is easy, there are many things that it does not do extremely well. \n",
    "\n",
    "### TRY THIS: PLOTTING\n",
    "Plot a line graph of the monthly average amount per call. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 24.7\tWhy you might not want to use pandas\n",
    "The examples above illustrate only a tiny fraction of the tools pandas can offer you in cleaning, exploring, and manipulating data.  As mentioned at the beginning of this chapter, pandas is an excellent tool set and it excels at what it was designed to do. That does not mean, however, that pandas is the tool for app situations or for all people. \n",
    "\n",
    "There are reasons why you might choose not use pandas and elect to use plain old Python (or some other tool) instead. For one thing, as mentioned above, learning to fully use pandas is in some ways like learning another language, and that may not be something you have the time or inclination for. Pandas also may not be in production situations, particularly with very large data sets that don’t require much in the way of math operations, and it situations where your data is not so easy to put into the formats that work best with pandas. Munging large collections of product information, for example, probably wouldn’t benefit so much from pandas, nor would basic processing of a stream of transactions. \n",
    "\n",
    "The point here is that you should choose your tools thoughtfully based on the problems at hand. In many cases pandas will truly make your life easier as you work with data, but in some, plain old Python may be your best bet.  \n",
    "\n",
    "## 24.8\tSummary\n",
    "* Python offers many benefits for data handling, including the ability to handle very large data sets, and the flexibility to handle data in ways that match your needs.\n",
    "* Jupyter notebook is a useful way to access Python via a web browser, which also makes improved presentation easier.\n",
    "* Pandas is a tool that makes many common operations in handling data much easier, including cleaning, combining, and summarizing data.\n",
    "* Pandas also makes simple plotting much easier. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 case study",
   "language": "python",
   "name": "python3_cs"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3+"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {
    "height": "707px",
    "left": "0px",
    "right": "1123px",
    "top": "110px",
    "width": "212px"
   },
   "toc_section_display": "block",
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
Subversion Repositories Boeken.quickpython

(root)/qpbe3e/exercise_answers/Ch24/Quick Python Book, 3rd ed, Chapter 24.ipynb – Rev 1