Tuesday, April 28, 2015

Staring at Goats IV: Movement

This is part 4 of a series of posts about goat breeding data.
[part 1: introduction] [part 2: Gender] [part 3: Fertility]

Movement

Goats are born at a breeder's site, but often move to another owner. Both breeders and owners are identified as a "stable". So we can examine how goats move between breeders and owners.

Recall from part 1 our graph model (Geit=goat, Stal= stable, Eigenaar=owner, Fokker=breeder):
For each goat, we thus know the breeder and the owner through relationships between goats and stables.

The database (or export thereof) does not keep track of intermediate owners, only current owners. So we have information about the breeder, and the current owner only.

There are four special "owners" with a special identifier:
NL001: Sold to an unregistered owner
NL002: Sold to a trader
NL003: Sold to the butcher
NL004: Dead


A simple cypher query shows how many goats are owned by one of the special owner IDs:
match (g:GEIT)-[:OWNER]->(stal:STAL{nr:"NL001"}) return count(g)
And similarly for the codes NL002, NL003, and NL004 This results in:
NL001: 2152 goats sold to an unregistered owner
NL002: 952 goats sold to a trader 
NL003: 1776 goats sold to the butcher (1232 bucks...)
NL004: 2929 goats dead

We exclude the special NL codes for further analysis, because they dominate the results. After all, eventually a goat dies and will get one of the NL codes.
So let's see which ten breeders and owners exchange most goats:
match (s1:STAL)<-[:FOKKER]-(g:GEIT)-[:OWNER]->(s2:STAL) where s2.naam <> s1.naam and NOT s2.nr IN ["NL001","NL002","NL003","NL004","NL005"] and s1.nr<>"ONB1" and s2.nr<>"ONB1" return s1.nr,s1.naam,count(g),s2.nr,s2.naam order by count(g) descending limit 10

So, breeder with number NB093 has 22 goats delivered to current owner LI013.

So far, no rocket science. Let's now see how goats move about between the Dutch provinces. Stable identifiers start with two letters which indicate the province: DR=Drente, FL=Flevoland, FR=Friesland, GE=Gelderland, GR=Groningen, LI=Limburg, NB=Noord-Brabant, NH=Noord-Holland, OV=Overijsel, UT=Utrecht, ZE=Zeeland, ZH=Zuid-Holland. Then there is IN which stands for international, i.e., outside The Netherlands.

We now get a complex cypher query, where we have to do some pattern matching using regular expressions and we have to group the results into provinces. Here it is:
match (s1:STAL)<-[:FOKKER]-(g:GEIT)-[:OWNER]->(s2:STAL) where s1.nr=~"[A-Z]{2}[0-9]{3}" and s2.nr=~"[A-Z]{2}[0-9]{3}" with s1, g, s2, SUBSTRING(s1.nr,0,2) AS p1, SUBSTRING(s2.nr,0,2) AS p2 return p1,p2,count(g) order by p1
Note how we use the regular expression in the where clause, and how we pass the results on to group them by province using the WITH keyword. The result is a table for which the first lines are:

So, 28 goats where bred in DR and their current owner is in OV, etc. We now put this into a matrix form:
The left column identifies the breeder's province, whereas the top row identifies the current owner's province. We deliberately only included goats in the cypher query where the breeder is not the same as the current owner. So the numbers identify goats that actually moved about.

We can turn this into a neat visual using the D3.js library.


This visual is called a chord diagram. The colour coding of the connections is such that it has the colour of the breeder's province, and points to the current owner's province. The thicker the source, the larger the number of goats.

For all provinces, most goats stay within the province. Next are the neighbouring provinces. Although it is interesting to see that no goat was moved from Zuid-Holland to Noord-Holland. (At least of the current owners, thus live goats.) Movements across large distances are uncommon. And international movements are also rare (due to legislation, of course). Just a few goats have been moved from other countries back into The Netherlands.

The D3.js library allows interaction as well, but unfortunately I can't show that on this page, because this blog site does not allow such Javascript on the page. Therefore, I created a separate page where you can play around with it, and you can view the source code of that page to see how it actually works: http://mekkerwei.be/data/goatMovement2.html

Saturday, April 25, 2015

Staring at Goats III: Fertility

This is part 3 of a series of posts about goat breeding data.
[part 1] [ part 2]

Fertility

A few bucks actually account for a large number of offspring. Who are the champions? A simple cypher query shows the list of goats with the most children in descending order.
match (g:GEIT)-[]->(g2:GEIT)-[]->(jaar:JAAR) return g2.naam,jaar.jaar,count(g) order by count(g) descending limit 100

Wow, 98 children!

Shown in a graph, this looks like this:


The couple with the most grandchildren

Which couple has the most grandchildren? Whereas the previous query is actually easy using SQL in a relational database, this one is a bit trickier in a relational database.

In a relational database, you probably have a goat table somewhere, with goat IDs and names, plus foreign keys to their parents, who can be found in the same table. Like the example given here:
Now, if Martha also has a number of children, knowing how many grandchildren a specific couple, like Joke and Willem, has, is not a simple query in SQL.

Let's have a look at the graph equivalent.
The grandparents are Joke and Willem. They have a number of children; Martha is not their only child. So, the two children of Martha, Swieber and Saar, are only two of the total number of grandchildren of Joke and Willem.

Let's look at the top ten couples and how many children they put forth. In cypher:
match (m:GEIT)<-[:MOEDER]-(g:GEIT)-[:VADER]->(v:GEIT) return m.naam,v.naam,count(g) order by count(g) descending limit 10

Borus is a goat from the early eighties, when there were not so many of the breed yet, so goats had to give birth to more offspring than is currently necessary. Naamloos means "no name".

Now we extend this to include grandchildren. In cypher:
match (m:GEIT)<-[:MOEDER]-(g:GEIT)-[:VADER]->(v:GEIT),(g:GEIT)<-[:MOEDER|:VADER]-(l:GEIT) return m.naam,v.naam,count(l) order by count(l) descending limit 10

The record couple thus has 105 grandchildren. And what a simple and fast query this is!

Interestingly, the third couple produced 98 grandchildren, all due to their only child Humberto, which is our champ buck from above. (And indeed, Humberto has no brothers nor sisters.)

Breeder productivity

Two simple questions: 
  1. Who are the most successful breeders?
  2. Who are the most productive breeders?
There is no clear definition of a successful breeder, but let's define it as the breeder of those goats who together produced the most children. In cypher:
match (g:GEIT)-[]->(g2:GEIT)-[:FOKKER]->(stal:STAL) return stal.naam,count(g) order by count(g) descending limit 100
(Note that this query is not trivial in SQL for a relational database as it contains some joins.)


The most successful breeder has no name. That's because these are unknown breeders, and are not necessarily identical breeders, so we have to ignore the first line. The winner has produced goats with in total 1201 children.

The second question is answered by examining at which breeder the most goats have been born. In cypher:
match (stal:STAL)<-[:FOKKER]-(g:GEIT) return stal.naam,count(g) order by count(g) descending limit 100

Again, the first row should be ignored, as these are the unknown breeders. So, the winner has 434 goats born on their farm, over all the years.

Conclusion

The graph model is a very easy model to quickly examine a data set. And some queries are actually much easier in cypher than in SQL, which means you can analyse the data set in quite a unique way, which brings you some new insights into the data.

Thursday, April 16, 2015

Staring at Goats II: Gender distribution

This is part 2 of a series of posts about goat breeding data.
[part 1]

In a previous post ("Staring at Goats") I showed how you can build and query a graph database of goats, using the Neo4J graph database and the cypher query language. It is now time to explore the data in more detail.

The extended model

We will now extend this model with gender data and the year in which the goats are born. Here is the extended model. (Lesson 2 in the Dutch language: Geslacht = Gender, Jaar = Year; see the previous post for lesson 1.)

We used the same technique loading these data from csv file as described in the previous post.

Just to check that the model works, let's find my own goats again and see with what nodes they are linked. We now see the gender and year added.



We can now ask questions like how many bucks and goats are born in a specific year? The cypher query for this is as follows. In this example, we choose 2015 as the year.
match (geit:GESLACHT)<-[:GESLACHT]-(g:GEIT)-[:JAAR]->(jaar:JAAR {jaar:15}) return geit.geslacht,count(g)
The outcome of this query is 178 bucks and 156 goats.

We can do this for each individual year, but we can grab this result for all years in one go using this cypher query:
match (:GESLACHT{geslacht:"Geit"})<-[:GESLACHT]-(geit:GEIT)-[:JAAR]->(jaar:JAAR)<-[:JAAR]-(bok:GEIT)-[:GESLACHT]->(:GESLACHT{geslacht:"Bok"}) return jaar.jaar,count(distinct(bok)),count(distinct(geit)) order by jaar.jaar descending
We now get a list with counted bucks and goats:

Cool! There is an interesting difference in the number of bucks and goats being born. Where we expect a 50/50 distribution, there are clearly more goats than bucks being born.

Hmm. This is probably a systematic effect: probably not all bucks are entered into the database! This might influence statistical data that is gender-dependent.

Gender ratios

Let's plot the percentages bucks and goats since 1975. For visualisation let's use D3.js (see https://github.com/mbostock/d3/wiki)

Our first plot simply shows the percentage of bucks versus goats being born, per year, since 1975.


Interestingly, in 2009, the percentage of bucks entered into the database went up from around 30% to about 50%. Since 2009, the gender statistics make sense. Before that, bucks may have ended up at the butcher's before ever being recorded. In the 1970's we talk about small numbers, so there is a larger variation in distribution.

If we plot the total number of goats born per year, we see how there is a steady rise from small numbers in 1975, to record 1200 goats in 2009. From then on, a decreasing trend in the number of bucks and goats being born is observed. That's less good news for the total population.


Read part 3: fertility

Thursday, April 09, 2015

Staring at Goats

Introduction

As a hobby breeder of goats I thought it might be interesting to do some data research on the database of the specific goat breed. In my case the goat breed is the rare Dutch Landrace goat, in Dutch the Nederlandse Landgeit. (http://nl.wikipedia.org/wiki/Landgeit)

The current population is descendent from 8 remaining goats in 1971. So there is a fairly well documented history of the current breed. All goats are registered when they are born, and subsequent important events in their life are also registered.

Example data that is kept with each goat includes name, registration number, date of birth, date of decease, breeder, owner, mother goat, father goat, colour, etc.

In contrast with humans, breeding of animals is not a tree structure, but a graph. So I decided to see what we can learn from the goat breeding data, if we model it as a graph.

The graph model

Let's start with a simple model. A goat has a father and a mother, is bred by a breeder, and owned by an owner. A goat has a certain colour.

Now let's understand a few Dutch words. Goat = Geit; Stable = Stal; Colour = Kleur; Father = Vader; Mother = Moeder; Owner = Eigenaar; Breeder = Fokker

Now you can fully understand the little graph model.

Of course, the nodes and relationships have properties as well, such as the registration number and name of the goat or stable. Therefore, we need to pick a graph database supporting this. It is not a surprise that I use the Neo4J graph database to support the model.

Neo4J

Installing Neo4J is straightforward, and loading it with the data is now simpler than ever with an excellent import tool that can handle csv files. (http://neo4j.com/blog/importing-data-neo4j-via-csv/) Exporting the goat database yields a csv file. Example snippet:


There are about 15,000 lines in the csv file. That's all goats since the early seventies until 31/03/2015, when I took the export.

I simply use the out-of-the-box Neo4J cypher console to load, query, and view the data.

Within 4 minutes, all goats are loaded into Neo4J using a simple Cypher query:
USING PERIODIC COMMIT 500LOAD CSV WITH HEADERS FROM "file:///landgeiten.csv" AS csvLine FIELDTERMINATOR ';'MERGE (geit:GEIT { nr: csvLine.RegistrationNumber, naam: csvLine.Name, geboren: csvLine.Born, overleden: csvLine.Deceased});
Once we have all goats, we do the same with the stables and colours. Then we have to define all relationships. As an example, for all father and mother relationships, this takes about 20 minutes with the following cypher query:
USING PERIODIC COMMIT 500LOAD CSV WITH HEADERS FROM "file:///landgeiten.csv" AS csvLine FIELDTERMINATOR ';'MATCH (geit:GEIT {nr: csvLine.RegistrationNumber }), (moeder:GEIT {nr: csvLine.MotherRegistrationNumber}), (vader:GEIT {nr: csvLine.FatherRegistrationNumber}) CREATE (geit) - [:VADER] -> (vader), (geit) - [:MOEDER] -> (moeder) ;
Next we do the same with the other relationships in our model.

Now let's see if I can find my own goats and their relationships. For this I have to search my own stable:
MATCH (stal:STAL {naam: "De Mekkerwei"}) return stal
Then we can double-click on the node to expand the nearest neighbours. This way we can easily explore the graph nodes and relationships around our centre of focus. An example is shown here:


You can now see how relationships between goats show up in the graph, in accordance with our model.

Querying the graph

You can do some interesting things with the graph. Let's explore it using some cypher queries.

First let's find out who is the father of the father of the father of ... of my goats. Since in the early seventies there were only four bucks of this goat breed, it is interesting to trace back to the roots. For this we have a simple cypher query:
match (stal:STAL {naam: "De Mekkerwei"}) <- [:OWNER] - (geit:GEIT) - [:VADER*] -> (ouder:GEIT) return stal,geit,ouder
The result is interesting:


If we only trace fathers back we see that we have two branches that come together with Ezechiel (leftmost node in the graph.) Ezechiel lived in the early 1960s and has registration number 2. There is not much else known, but a picture exists:


Another interesting graph question to ask is how many goats have a father or mother that is also their grandfather or grandmother. With humans not a very relevant question, but with animal breeders this is certainly possible. (Normally unwanted, but hey, accidents happen, too!) The Cypher query for this:
match (g:GEIT)<-[]-(g2:GEIT), (g:GEIT)<-[]-(g1:GEIT)<-[]-(g2:GEIT) return g,g1,g2
This returns a graph with almost 400 goats involved, from which we can take a small part:

There are lots of other interesting cypher queries that one can do on datasets like these, but that will be the topic of a future post.

Conclusion

It is clear that a graph database can be useful to support animal breeders, especially when examining inbreeding and forefathers. It should help in selecting the right animals for breeding, and to know which are less suited for breeding due to close relationships.

Follow the goats

If you like this post, consider liking my goat breeding page as well: https://www.facebook.com/demekkerwei ! :-)

Read part 2: Gender distribution

Thursday, February 20, 2014

How much is 14 000 000 000 €?

Facebook bought a messaging app for nearly 14 billion euros. That's a 14 with nine zeros. That's a lot of money. Facebook's revenue for 2013 was about 8 billion euros. Also a lot of money. Whatsapp's revenu for 2013 was only about 15 million euros.

Why do you want to pay a thousand times more than its yearly revenue for a company?

Probably because money is not worth anything anymore.

For 14 billion euros you can buy some countries, too. If the gross domestic product (GDP) is a value of a country, for the same amount of money you can get Botswana and Monaco together.

Botswana and Monaco at least have real value. Real land, real resources, real things.

How real is Whatsapp?

In the 17th century, tulip bulbs were rated higher than Amsterdam houses. Until someone asked himself how can it be that some little thing with hardly any intrinsic value can be worth more than a house with a large intrinsic value. Tulip mania, the world's first economic bubble was born.

Even though Whatsapp may at some point reach revenues of billions, it still has hardly any intrinsic value. The only conclusion must be that this is a new economic bubble.

If, tomorrow, a majority of Facebook users stop using it, the value of Facebook plunges.

The disturbing thing is that these kind of transactions are backed by large financial institutions that may very well run into trouble when these kind of transactions fail. And these financial institutions, we have seen over the past years, can have a devastating effect an the economy. How can it be that these disproportional transactions are possible?

Money isn't anything worth anymore. Tomorrow, all this money may be gone again.

Botswana and Monaco on the other hand will remain.

Sunday, February 16, 2014

Time measurement in speed skating

On Saturday 15 February 2014, during the Sochi Olympic games, Zbigniew Brodka of Poland won the 1500 meter speed skating race with 0.003 s ahead of Dutch skater Koen Verweij.

That's three thousandths of a second. Both skaters finished in the same time, and then the thousandths will decide.

Have we seen this before? Remember the Olympics from 1980 in Lake Placid? Swedish skier Thomas Wassberg won the 15 km cross country with 0.01 s over Finnish skier Juha Mieto: http://www.youtube.com/watch?v=M9mYj4FKDtg

After Lake Placid, it was decided to no longer measure times in hundredths of a second, because the start of the race cannot be measured with such accuracy.

Will the same happen now? Why has someone decided that time can be measured with an accuracy of one thousandths of a second in speed skating?

Of course, technically we can measure very accurately, but does it make sense? In speed skating on the 1500 meter distance, one rider starts in the inner lane, and one in the outer lane. The distance between the two riders is about 3 metres. The starter uses a starting pistol, which makes a sound. The riders start when they hear the sound, and the time starts running electronically. The starter is located at the outside of the track, closest to he rider in the outer lane.

The distance between the two riders means the inner lane hears the start sound about 0.01 seconds later than the outer lane. That's because sounds travels at a speed of about 300 metres per second, which is about 3 metres in 0.01 second.

in other words, the skater in the inner lane has a disadvantage of about 0.01 seconds. That's the intrinsic uncertainty of time measurement in speed skating. It does not make any sense to try to measure time more accurately, even though this is perfectly possible technically.

Also, if a rider starts too quickly, he risks a false start. Whether a start is false or not is the sole judgement of the starter, who visually judges the start. Can he judge accurately down to a thousandths of a second? I figure a hundredth of a second is already pretty tough to judge.

Interestingly, Brodka - who was declared the sole winner - started in the outer lane, whereas Verweij started in the inner lane.

Technology is clearly overused in this case. There is too much belief in technology, and normal physics of the real world where the speed of sound is not infinitely large is forgotten.

It is a pitty that riders like Mieto in 1980 and Verweij in 2014 are not rewarded with the gold medal, as the time differences measured are well within the intrinsic uncertainty of time measurement.

Here is the official time measurement explanation, where physics is totally forgotten: http://www.omegawatches.com/news/international-news/international-news-detail/2232.

And here a good explaining picture (in Dutch): https://twitter.com/annejan88/status/434777328955441152/photo/1

Thursday, August 01, 2013

Meteorite test set

In a previous post I wrote about impact analysis of modifying a web service contract. In this post I will provide you with a small test set, and guide you through the process of impact analysis.

First you need to download the sample archive file. You can unpack it and open it in your favourite XML and WSDL editor to examine the details. I will describe it here briefly.

There are two SOAP web services, the nothing service and the something service. The nothing service only has a single operation, doNothing(), whereas the something service has two operations, getSomething() and searchSomething().

Both services require a custom SOAP header element with some token inside. Otherwise the service won't work.

The web service contracts defined in the WSDLs follow the document literal pattern, and the message bodies for the input and output messages of each operation are defined in separate XML files. These XML files and their namespaces are imported in the WSDLs.

The nothing service does basically nothing. That is, it will simply return a fixed string, no matter what string you input.

The something service on the other hand can search for something, and it can get something. Something is defined in its own XML schema file, which is imported in the message XML schema file.

In case you search for something, you will receive a list of somethings, each with an ID and a name. If you want more details of something, you need to get the something directly via its ID, and you get some more details. A detailed something is an extension of a basic something in the XML schema.

Suppose the nothing service and the something service are successfully serving multiple clients, but a change in configuration requires that the header must be adapted and a unique ID must be added to it, and there is a new optional detail in something that we would like to return in the getSomething() operation. What is the impact of these two small changes?

To find out, we upload the archive to Meteorite. Logon, select the archive, and upload it. After uploading, you will see a collapsed data tree. This is the tree of files and the WSDL and XML schema nodes.

First we will examine the impact of modifying the header. It's the "MeteoriteHeaderType" node in .../header.xsd which we want to modify with an extra element. Open the tree to find it, and then click on the control button on the right to perform an impact analysis.

We are now back at the collapsed data tree. The impact analysis is done, and we now have four views with which we can examine the impact. The data tree is the first of the four views.

Data Tree View.
This is the default view because it is so close to your input hierarchy. Expand all nodes and scroll a bit down. In red is the node which is the subject of the impact analysis. In blue all nodes that somehow are impacted by a change. Scroll down further and you will see that both services are impacted, but the impact is limited to the headers only. Because both input and output header are impacted, this change is not backwards compatible, and thus all clients of the web service are impacted.

Impact Report View.
This view only lists the WSDL files with impact. It ignores all XML schema files. So it provides a high-level overview of services that are impacted. However, header impact is not shown. Impact is limited to the bodies of the SOAP messages. (This is by design, as header fields are handled differently from body fields in most XML binding frameworks.)

Export the Impact to Text
This view shows you a text file with all the paths from the impacted node upward to the web service contract. For example, one of the six different paths is this one:
"(MeteoriteHeaderType)<--[INHERITS_FROM]--(meteoriteHeader)<--[USES_ELEMENT]--(header)<--[CONTAINS]--(headerResponse)<--[CONTAINS]--(schemas/com/archimiddle/meteorite/sampleServices/wsdl/somethingService.wsdl)<--[CONTAINS]--(http://meteorite.archimiddle.com/wsdl/somethingService)<--[IS_PARENT]--(Project - meteorite-test.rar)"
The notation is based on the Cypher query language for graph databases. Nodes are between brackets and relationships are between square brackets and form a directed arrow. Note that the file names as well as the namespaces are also considered nodes in the graph. After all, they are relevant through the import and schemaLocation elements and attributes.

Export the Impact to Image
This view shows you the impact graph as an image. Open it and you will something like this:
(Click on the image to enlarge)
Here we see the six paths again.

Next we want to add an optional detail to something. What is the impact of that?

Go back to "View the Data Tree." (Not the one from the impact menu, the other one) We have now cleared the previous impact analysis, and we can start with a new one.

In the search box upper right, start typing SomethingDetailsType and pick it once it shows. Click on the search button and you have your node on the screen. Now click the impact analysis button on the right of it.

Go to the impact report view and we see that only the something service is impacted. The nothing service is not affected by this change.

Export to image and you will get this impact graph:

We see that only the getSomething() operation is affected, not the searchSomething() operation. This is so because the search will only return basic something, not the detailed something.

This concludes the short tutorial for Meteorite. You can modify the schemas and experiment with it yourself. It is something surprising what impact a small and seemingly innocent change in a schema can have!

Now try to analyze the impact of a change without using the Meteorite tool. Did you find all nodes that are impacted? How long did it take you?

Meteorite uses a graph database (Neo4J) to traverse the impact graph and is therefore very fast. This is just a simple example of the power of graph databases in performing impact analysis in complex structures.