Saturday, April 25, 2015

Staring at Goats III: Fertility

This is part 3 of a series of posts about goat breeding data.
[part 1] [ part 2]

Fertility

A few bucks actually account for a large number of offspring. Who are the champions? A simple cypher query shows the list of goats with the most children in descending order.
match (g:GEIT)-[]->(g2:GEIT)-[]->(jaar:JAAR) return g2.naam,jaar.jaar,count(g) order by count(g) descending limit 100

Wow, 98 children!

Shown in a graph, this looks like this:


The couple with the most grandchildren

Which couple has the most grandchildren? Whereas the previous query is actually easy using SQL in a relational database, this one is a bit trickier in a relational database.

In a relational database, you probably have a goat table somewhere, with goat IDs and names, plus foreign keys to their parents, who can be found in the same table. Like the example given here:
Now, if Martha also has a number of children, knowing how many grandchildren a specific couple, like Joke and Willem, has, is not a simple query in SQL.

Let's have a look at the graph equivalent.
The grandparents are Joke and Willem. They have a number of children; Martha is not their only child. So, the two children of Martha, Swieber and Saar, are only two of the total number of grandchildren of Joke and Willem.

Let's look at the top ten couples and how many children they put forth. In cypher:
match (m:GEIT)<-[:MOEDER]-(g:GEIT)-[:VADER]->(v:GEIT) return m.naam,v.naam,count(g) order by count(g) descending limit 10

Borus is a goat from the early eighties, when there were not so many of the breed yet, so goats had to give birth to more offspring than is currently necessary. Naamloos means "no name".

Now we extend this to include grandchildren. In cypher:
match (m:GEIT)<-[:MOEDER]-(g:GEIT)-[:VADER]->(v:GEIT),(g:GEIT)<-[:MOEDER|:VADER]-(l:GEIT) return m.naam,v.naam,count(l) order by count(l) descending limit 10

The record couple thus has 105 grandchildren. And what a simple and fast query this is!

Interestingly, the third couple produced 98 grandchildren, all due to their only child Humberto, which is our champ buck from above. (And indeed, Humberto has no brothers nor sisters.)

Breeder productivity

Two simple questions: 
  1. Who are the most successful breeders?
  2. Who are the most productive breeders?
There is no clear definition of a successful breeder, but let's define it as the breeder of those goats who together produced the most children. In cypher:
match (g:GEIT)-[]->(g2:GEIT)-[:FOKKER]->(stal:STAL) return stal.naam,count(g) order by count(g) descending limit 100
(Note that this query is not trivial in SQL for a relational database as it contains some joins.)


The most successful breeder has no name. That's because these are unknown breeders, and are not necessarily identical breeders, so we have to ignore the first line. The winner has produced goats with in total 1201 children.

The second question is answered by examining at which breeder the most goats have been born. In cypher:
match (stal:STAL)<-[:FOKKER]-(g:GEIT) return stal.naam,count(g) order by count(g) descending limit 100

Again, the first row should be ignored, as these are the unknown breeders. So, the winner has 434 goats born on their farm, over all the years.

Conclusion

The graph model is a very easy model to quickly examine a data set. And some queries are actually much easier in cypher than in SQL, which means you can analyse the data set in quite a unique way, which brings you some new insights into the data.

No comments:

Post a Comment