Wednesday 1 January 2014

Reading in a CSV File

So before I get on with the code and discussion for this post, let me tell you a little bit about the setup I develop in at home and for work.
I use a 15" Macbook Pro with 4GB of RAM (I really should upgrade this), with a 2.2 GHz Intel i7 processor (4 cores) and an Intel HD Graphics 3000 GPU running OS X Mavericks (Late 2011 model). I have also boot camped my Macbook Pro so that I have what I call my "dirty side" of the Mac, which is running Windows 7.
The "dirty side" I use for XNA and C# (Visual Studio 2010), and now also Game Maker Studio. While the Mac side is used for everything else, so Python, Objective C, and when I get a chance to look at it Unity.
For Python I either use my Mac or I use one of my many (5 at the last count) Raspberry Pi's (You will in future posts get to see code developed on the Pi specifically - well you have to, to use the GPIO etc).
The majority of the Python code that will appear on this blog will be 2.7. The main reason being this is the version that is used by Codecademy at the moment, and my students are using this to learn the basics of Python. So until Codecademy offers Python 3.x then that's the version I will stick with. Don't want to confuse the students too much. They are having to learn Python and C# already.

Right on to the main point of todays post, which is a follow on from yesterday where I talked about using Open Data.
If you remember I mentioned yesterday one source of Open Data being the MP Expenses data that is published. Today I will use that data for showing how to read in a csv file using Python.

I downloaded the csv files for 2010,2011,2012 and 2013. Which gave me four csv files that need reading in.
In a text editor I removed the first line of each file, it's the header giving the name of each column in the csv file. This was to make the next step easier. NOTE I made a copy of these headings to use later.
The next step I did was combine these into a single file to read in. I did that with the following command line:

cat mpexp2010.csv mpexp2011.csv mpexp2012.csv mpexp2013.csv > mpexpall.csv

Remember I develop on a Mac so this should work for Linux based systems as well (that includes Pi's), sorry Windows users you are on your own here. You will also note my filenames are different to the ones downloaded. I renamed the files for easier typing on the command line, and to make it more obvious what the data was.
To recap where we are, we have a single csv file of all the MP Expense data, and a copy of the column headings that looks something like the following:

Year, Date, Claim No., MP's Name, MP's Constituency, Category, Expense Type, Short Description, Details, Journey Type, From, To, Travel, Nights, Mileage, Amount Claimed, Amount Paid, Amount Not Paid, Amount Repaid, Status, Reason If Not Paid

The headings need cleaning up for our use, there are extra spaces in there and punctuation. So I edited the headings to look like the following:

Year,Date,ClaimNo,MPName,Constituency,Category,ExpenseType,ShortDescription,Details,JourneyType,From,To,Travel,Nights,Mileage,AmountClaimed,AmountPaid,AmountNotPaid,AmountRepaid,Status,ReasonNotPaid

These edited headings I then put back into the single csv at the start. I didn't have to, we could read in the csv file without them.

So now we need to read in our csv file. Below is the commented Python code to do this.
So that bit of Python code has read in our csv file of MP Expenses, and placed it into a list of dictionaries for us.

Lets do something "interesting" with this data now and calculate the average or arithmetic mean of the rent paid by MP's in 2012/13. Add the following code to the end of the code above that reads in the csv file.
So where next? Well there are libraries like numpy, scipy and matplotlib that can be used to help visualise the data (see further reading section below for links to books and a useful web page).
The above could of also been done in Excel, as could any further manipulation and visualisation of the data.
In fact which ever way chosen getting the students to present their visualisations as infographics would be a great way for them to present their work.

Further reading:
You can read more in the official documentation on csv files here.
2. Numerical and Scientific Python and Data Visualisation
Python Data Visualization Cookbook
Python for Data Analysis

No comments:

Post a Comment