Tuesday, April 28, 2015

Staring at Goats IV: Movement

This is part 4 of a series of posts about goat breeding data.
[part 1: introduction] [part 2: Gender] [part 3: Fertility]

Movement

Goats are born at a breeder's site, but often move to another owner. Both breeders and owners are identified as a "stable". So we can examine how goats move between breeders and owners.

Recall from part 1 our graph model (Geit=goat, Stal= stable, Eigenaar=owner, Fokker=breeder):
For each goat, we thus know the breeder and the owner through relationships between goats and stables.

The database (or export thereof) does not keep track of intermediate owners, only current owners. So we have information about the breeder, and the current owner only.

There are four special "owners" with a special identifier:
NL001: Sold to an unregistered owner
NL002: Sold to a trader
NL003: Sold to the butcher
NL004: Dead


A simple cypher query shows how many goats are owned by one of the special owner IDs:
match (g:GEIT)-[:OWNER]->(stal:STAL{nr:"NL001"}) return count(g)
And similarly for the codes NL002, NL003, and NL004 This results in:
NL001: 2152 goats sold to an unregistered owner
NL002: 952 goats sold to a trader 
NL003: 1776 goats sold to the butcher (1232 bucks...)
NL004: 2929 goats dead

We exclude the special NL codes for further analysis, because they dominate the results. After all, eventually a goat dies and will get one of the NL codes.
So let's see which ten breeders and owners exchange most goats:
match (s1:STAL)<-[:FOKKER]-(g:GEIT)-[:OWNER]->(s2:STAL) where s2.naam <> s1.naam and NOT s2.nr IN ["NL001","NL002","NL003","NL004","NL005"] and s1.nr<>"ONB1" and s2.nr<>"ONB1" return s1.nr,s1.naam,count(g),s2.nr,s2.naam order by count(g) descending limit 10

So, breeder with number NB093 has 22 goats delivered to current owner LI013.

So far, no rocket science. Let's now see how goats move about between the Dutch provinces. Stable identifiers start with two letters which indicate the province: DR=Drente, FL=Flevoland, FR=Friesland, GE=Gelderland, GR=Groningen, LI=Limburg, NB=Noord-Brabant, NH=Noord-Holland, OV=Overijsel, UT=Utrecht, ZE=Zeeland, ZH=Zuid-Holland. Then there is IN which stands for international, i.e., outside The Netherlands.

We now get a complex cypher query, where we have to do some pattern matching using regular expressions and we have to group the results into provinces. Here it is:
match (s1:STAL)<-[:FOKKER]-(g:GEIT)-[:OWNER]->(s2:STAL) where s1.nr=~"[A-Z]{2}[0-9]{3}" and s2.nr=~"[A-Z]{2}[0-9]{3}" with s1, g, s2, SUBSTRING(s1.nr,0,2) AS p1, SUBSTRING(s2.nr,0,2) AS p2 return p1,p2,count(g) order by p1
Note how we use the regular expression in the where clause, and how we pass the results on to group them by province using the WITH keyword. The result is a table for which the first lines are:

So, 28 goats where bred in DR and their current owner is in OV, etc. We now put this into a matrix form:
The left column identifies the breeder's province, whereas the top row identifies the current owner's province. We deliberately only included goats in the cypher query where the breeder is not the same as the current owner. So the numbers identify goats that actually moved about.

We can turn this into a neat visual using the D3.js library.


This visual is called a chord diagram. The colour coding of the connections is such that it has the colour of the breeder's province, and points to the current owner's province. The thicker the source, the larger the number of goats.

For all provinces, most goats stay within the province. Next are the neighbouring provinces. Although it is interesting to see that no goat was moved from Zuid-Holland to Noord-Holland. (At least of the current owners, thus live goats.) Movements across large distances are uncommon. And international movements are also rare (due to legislation, of course). Just a few goats have been moved from other countries back into The Netherlands.

The D3.js library allows interaction as well, but unfortunately I can't show that on this page, because this blog site does not allow such Javascript on the page. Therefore, I created a separate page where you can play around with it, and you can view the source code of that page to see how it actually works: http://mekkerwei.be/data/goatMovement2.html

Saturday, April 25, 2015

Staring at Goats III: Fertility

This is part 3 of a series of posts about goat breeding data.
[part 1] [ part 2]

Fertility

A few bucks actually account for a large number of offspring. Who are the champions? A simple cypher query shows the list of goats with the most children in descending order.
match (g:GEIT)-[]->(g2:GEIT)-[]->(jaar:JAAR) return g2.naam,jaar.jaar,count(g) order by count(g) descending limit 100

Wow, 98 children!

Shown in a graph, this looks like this:


The couple with the most grandchildren

Which couple has the most grandchildren? Whereas the previous query is actually easy using SQL in a relational database, this one is a bit trickier in a relational database.

In a relational database, you probably have a goat table somewhere, with goat IDs and names, plus foreign keys to their parents, who can be found in the same table. Like the example given here:
Now, if Martha also has a number of children, knowing how many grandchildren a specific couple, like Joke and Willem, has, is not a simple query in SQL.

Let's have a look at the graph equivalent.
The grandparents are Joke and Willem. They have a number of children; Martha is not their only child. So, the two children of Martha, Swieber and Saar, are only two of the total number of grandchildren of Joke and Willem.

Let's look at the top ten couples and how many children they put forth. In cypher:
match (m:GEIT)<-[:MOEDER]-(g:GEIT)-[:VADER]->(v:GEIT) return m.naam,v.naam,count(g) order by count(g) descending limit 10

Borus is a goat from the early eighties, when there were not so many of the breed yet, so goats had to give birth to more offspring than is currently necessary. Naamloos means "no name".

Now we extend this to include grandchildren. In cypher:
match (m:GEIT)<-[:MOEDER]-(g:GEIT)-[:VADER]->(v:GEIT),(g:GEIT)<-[:MOEDER|:VADER]-(l:GEIT) return m.naam,v.naam,count(l) order by count(l) descending limit 10

The record couple thus has 105 grandchildren. And what a simple and fast query this is!

Interestingly, the third couple produced 98 grandchildren, all due to their only child Humberto, which is our champ buck from above. (And indeed, Humberto has no brothers nor sisters.)

Breeder productivity

Two simple questions: 
  1. Who are the most successful breeders?
  2. Who are the most productive breeders?
There is no clear definition of a successful breeder, but let's define it as the breeder of those goats who together produced the most children. In cypher:
match (g:GEIT)-[]->(g2:GEIT)-[:FOKKER]->(stal:STAL) return stal.naam,count(g) order by count(g) descending limit 100
(Note that this query is not trivial in SQL for a relational database as it contains some joins.)


The most successful breeder has no name. That's because these are unknown breeders, and are not necessarily identical breeders, so we have to ignore the first line. The winner has produced goats with in total 1201 children.

The second question is answered by examining at which breeder the most goats have been born. In cypher:
match (stal:STAL)<-[:FOKKER]-(g:GEIT) return stal.naam,count(g) order by count(g) descending limit 100

Again, the first row should be ignored, as these are the unknown breeders. So, the winner has 434 goats born on their farm, over all the years.

Conclusion

The graph model is a very easy model to quickly examine a data set. And some queries are actually much easier in cypher than in SQL, which means you can analyse the data set in quite a unique way, which brings you some new insights into the data.

Thursday, April 16, 2015

Staring at Goats II: Gender distribution

This is part 2 of a series of posts about goat breeding data.
[part 1]

In a previous post ("Staring at Goats") I showed how you can build and query a graph database of goats, using the Neo4J graph database and the cypher query language. It is now time to explore the data in more detail.

The extended model

We will now extend this model with gender data and the year in which the goats are born. Here is the extended model. (Lesson 2 in the Dutch language: Geslacht = Gender, Jaar = Year; see the previous post for lesson 1.)

We used the same technique loading these data from csv file as described in the previous post.

Just to check that the model works, let's find my own goats again and see with what nodes they are linked. We now see the gender and year added.



We can now ask questions like how many bucks and goats are born in a specific year? The cypher query for this is as follows. In this example, we choose 2015 as the year.
match (geit:GESLACHT)<-[:GESLACHT]-(g:GEIT)-[:JAAR]->(jaar:JAAR {jaar:15}) return geit.geslacht,count(g)
The outcome of this query is 178 bucks and 156 goats.

We can do this for each individual year, but we can grab this result for all years in one go using this cypher query:
match (:GESLACHT{geslacht:"Geit"})<-[:GESLACHT]-(geit:GEIT)-[:JAAR]->(jaar:JAAR)<-[:JAAR]-(bok:GEIT)-[:GESLACHT]->(:GESLACHT{geslacht:"Bok"}) return jaar.jaar,count(distinct(bok)),count(distinct(geit)) order by jaar.jaar descending
We now get a list with counted bucks and goats:

Cool! There is an interesting difference in the number of bucks and goats being born. Where we expect a 50/50 distribution, there are clearly more goats than bucks being born.

Hmm. This is probably a systematic effect: probably not all bucks are entered into the database! This might influence statistical data that is gender-dependent.

Gender ratios

Let's plot the percentages bucks and goats since 1975. For visualisation let's use D3.js (see https://github.com/mbostock/d3/wiki)

Our first plot simply shows the percentage of bucks versus goats being born, per year, since 1975.


Interestingly, in 2009, the percentage of bucks entered into the database went up from around 30% to about 50%. Since 2009, the gender statistics make sense. Before that, bucks may have ended up at the butcher's before ever being recorded. In the 1970's we talk about small numbers, so there is a larger variation in distribution.

If we plot the total number of goats born per year, we see how there is a steady rise from small numbers in 1975, to record 1200 goats in 2009. From then on, a decreasing trend in the number of bucks and goats being born is observed. That's less good news for the total population.


Read part 3: fertility

Thursday, April 09, 2015

Staring at Goats

Introduction

As a hobby breeder of goats I thought it might be interesting to do some data research on the database of the specific goat breed. In my case the goat breed is the rare Dutch Landrace goat, in Dutch the Nederlandse Landgeit. (http://nl.wikipedia.org/wiki/Landgeit)

The current population is descendent from 8 remaining goats in 1971. So there is a fairly well documented history of the current breed. All goats are registered when they are born, and subsequent important events in their life are also registered.

Example data that is kept with each goat includes name, registration number, date of birth, date of decease, breeder, owner, mother goat, father goat, colour, etc.

In contrast with humans, breeding of animals is not a tree structure, but a graph. So I decided to see what we can learn from the goat breeding data, if we model it as a graph.

The graph model

Let's start with a simple model. A goat has a father and a mother, is bred by a breeder, and owned by an owner. A goat has a certain colour.

Now let's understand a few Dutch words. Goat = Geit; Stable = Stal; Colour = Kleur; Father = Vader; Mother = Moeder; Owner = Eigenaar; Breeder = Fokker

Now you can fully understand the little graph model.

Of course, the nodes and relationships have properties as well, such as the registration number and name of the goat or stable. Therefore, we need to pick a graph database supporting this. It is not a surprise that I use the Neo4J graph database to support the model.

Neo4J

Installing Neo4J is straightforward, and loading it with the data is now simpler than ever with an excellent import tool that can handle csv files. (http://neo4j.com/blog/importing-data-neo4j-via-csv/) Exporting the goat database yields a csv file. Example snippet:


There are about 15,000 lines in the csv file. That's all goats since the early seventies until 31/03/2015, when I took the export.

I simply use the out-of-the-box Neo4J cypher console to load, query, and view the data.

Within 4 minutes, all goats are loaded into Neo4J using a simple Cypher query:
USING PERIODIC COMMIT 500LOAD CSV WITH HEADERS FROM "file:///landgeiten.csv" AS csvLine FIELDTERMINATOR ';'MERGE (geit:GEIT { nr: csvLine.RegistrationNumber, naam: csvLine.Name, geboren: csvLine.Born, overleden: csvLine.Deceased});
Once we have all goats, we do the same with the stables and colours. Then we have to define all relationships. As an example, for all father and mother relationships, this takes about 20 minutes with the following cypher query:
USING PERIODIC COMMIT 500LOAD CSV WITH HEADERS FROM "file:///landgeiten.csv" AS csvLine FIELDTERMINATOR ';'MATCH (geit:GEIT {nr: csvLine.RegistrationNumber }), (moeder:GEIT {nr: csvLine.MotherRegistrationNumber}), (vader:GEIT {nr: csvLine.FatherRegistrationNumber}) CREATE (geit) - [:VADER] -> (vader), (geit) - [:MOEDER] -> (moeder) ;
Next we do the same with the other relationships in our model.

Now let's see if I can find my own goats and their relationships. For this I have to search my own stable:
MATCH (stal:STAL {naam: "De Mekkerwei"}) return stal
Then we can double-click on the node to expand the nearest neighbours. This way we can easily explore the graph nodes and relationships around our centre of focus. An example is shown here:


You can now see how relationships between goats show up in the graph, in accordance with our model.

Querying the graph

You can do some interesting things with the graph. Let's explore it using some cypher queries.

First let's find out who is the father of the father of the father of ... of my goats. Since in the early seventies there were only four bucks of this goat breed, it is interesting to trace back to the roots. For this we have a simple cypher query:
match (stal:STAL {naam: "De Mekkerwei"}) <- [:OWNER] - (geit:GEIT) - [:VADER*] -> (ouder:GEIT) return stal,geit,ouder
The result is interesting:


If we only trace fathers back we see that we have two branches that come together with Ezechiel (leftmost node in the graph.) Ezechiel lived in the early 1960s and has registration number 2. There is not much else known, but a picture exists:


Another interesting graph question to ask is how many goats have a father or mother that is also their grandfather or grandmother. With humans not a very relevant question, but with animal breeders this is certainly possible. (Normally unwanted, but hey, accidents happen, too!) The Cypher query for this:
match (g:GEIT)<-[]-(g2:GEIT), (g:GEIT)<-[]-(g1:GEIT)<-[]-(g2:GEIT) return g,g1,g2
This returns a graph with almost 400 goats involved, from which we can take a small part:

There are lots of other interesting cypher queries that one can do on datasets like these, but that will be the topic of a future post.

Conclusion

It is clear that a graph database can be useful to support animal breeders, especially when examining inbreeding and forefathers. It should help in selecting the right animals for breeding, and to know which are less suited for breeding due to close relationships.

Follow the goats

If you like this post, consider liking my goat breeding page as well: https://www.facebook.com/demekkerwei ! :-)

Read part 2: Gender distribution