Thursday, April 09, 2015

Staring at Goats

Introduction

As a hobby breeder of goats I thought it might be interesting to do some data research on the database of the specific goat breed. In my case the goat breed is the rare Dutch Landrace goat, in Dutch the Nederlandse Landgeit. (http://nl.wikipedia.org/wiki/Landgeit)

The current population is descendent from 8 remaining goats in 1971. So there is a fairly well documented history of the current breed. All goats are registered when they are born, and subsequent important events in their life are also registered.

Example data that is kept with each goat includes name, registration number, date of birth, date of decease, breeder, owner, mother goat, father goat, colour, etc.

In contrast with humans, breeding of animals is not a tree structure, but a graph. So I decided to see what we can learn from the goat breeding data, if we model it as a graph.

The graph model

Let's start with a simple model. A goat has a father and a mother, is bred by a breeder, and owned by an owner. A goat has a certain colour.

Now let's understand a few Dutch words. Goat = Geit; Stable = Stal; Colour = Kleur; Father = Vader; Mother = Moeder; Owner = Eigenaar; Breeder = Fokker

Now you can fully understand the little graph model.

Of course, the nodes and relationships have properties as well, such as the registration number and name of the goat or stable. Therefore, we need to pick a graph database supporting this. It is not a surprise that I use the Neo4J graph database to support the model.

Neo4J

Installing Neo4J is straightforward, and loading it with the data is now simpler than ever with an excellent import tool that can handle csv files. (http://neo4j.com/blog/importing-data-neo4j-via-csv/) Exporting the goat database yields a csv file. Example snippet:


There are about 15,000 lines in the csv file. That's all goats since the early seventies until 31/03/2015, when I took the export.

I simply use the out-of-the-box Neo4J cypher console to load, query, and view the data.

Within 4 minutes, all goats are loaded into Neo4J using a simple Cypher query:
USING PERIODIC COMMIT 500LOAD CSV WITH HEADERS FROM "file:///landgeiten.csv" AS csvLine FIELDTERMINATOR ';'MERGE (geit:GEIT { nr: csvLine.RegistrationNumber, naam: csvLine.Name, geboren: csvLine.Born, overleden: csvLine.Deceased});
Once we have all goats, we do the same with the stables and colours. Then we have to define all relationships. As an example, for all father and mother relationships, this takes about 20 minutes with the following cypher query:
USING PERIODIC COMMIT 500LOAD CSV WITH HEADERS FROM "file:///landgeiten.csv" AS csvLine FIELDTERMINATOR ';'MATCH (geit:GEIT {nr: csvLine.RegistrationNumber }), (moeder:GEIT {nr: csvLine.MotherRegistrationNumber}), (vader:GEIT {nr: csvLine.FatherRegistrationNumber}) CREATE (geit) - [:VADER] -> (vader), (geit) - [:MOEDER] -> (moeder) ;
Next we do the same with the other relationships in our model.

Now let's see if I can find my own goats and their relationships. For this I have to search my own stable:
MATCH (stal:STAL {naam: "De Mekkerwei"}) return stal
Then we can double-click on the node to expand the nearest neighbours. This way we can easily explore the graph nodes and relationships around our centre of focus. An example is shown here:


You can now see how relationships between goats show up in the graph, in accordance with our model.

Querying the graph

You can do some interesting things with the graph. Let's explore it using some cypher queries.

First let's find out who is the father of the father of the father of ... of my goats. Since in the early seventies there were only four bucks of this goat breed, it is interesting to trace back to the roots. For this we have a simple cypher query:
match (stal:STAL {naam: "De Mekkerwei"}) <- [:OWNER] - (geit:GEIT) - [:VADER*] -> (ouder:GEIT) return stal,geit,ouder
The result is interesting:


If we only trace fathers back we see that we have two branches that come together with Ezechiel (leftmost node in the graph.) Ezechiel lived in the early 1960s and has registration number 2. There is not much else known, but a picture exists:


Another interesting graph question to ask is how many goats have a father or mother that is also their grandfather or grandmother. With humans not a very relevant question, but with animal breeders this is certainly possible. (Normally unwanted, but hey, accidents happen, too!) The Cypher query for this:
match (g:GEIT)<-[]-(g2:GEIT), (g:GEIT)<-[]-(g1:GEIT)<-[]-(g2:GEIT) return g,g1,g2
This returns a graph with almost 400 goats involved, from which we can take a small part:

There are lots of other interesting cypher queries that one can do on datasets like these, but that will be the topic of a future post.

Conclusion

It is clear that a graph database can be useful to support animal breeders, especially when examining inbreeding and forefathers. It should help in selecting the right animals for breeding, and to know which are less suited for breeding due to close relationships.

Follow the goats

If you like this post, consider liking my goat breeding page as well: https://www.facebook.com/demekkerwei ! :-)

Read part 2: Gender distribution

No comments:

Post a Comment