API Documentation: Crazy Idea?

In putting together documentation for a current API project, I browsed around for an off-the-shelf solution to easily generate some beautiful documents. Aesthetics are my first priority. Portability to popular API gateways like Mashape or Apigee is a second priority. The third consideration is the quality of tooling to pipe the portable API data into the beautiful templates.

Aesthetics

The Stripe docs are some of my favourites. Unfortunately, Stripe has not yet released the tools it uses to generate those beautiful docs. That’s when I first came across Slate from Robert Lord. Though Stripe was my first choice, the docs for Twitter and Uber docs are both inspirational. Mandrill docs are attractive. And the default output template for ApiDoc has great potential.

In the second tier, the look and feel of Facebook docs fall a bit short. Although I prefer the Mailgun service to Mandrill, their docs are clear, but unexciting. Twilio is another fantastic service, but the design of their docs is second tier.

Portability

The leading standard for portability is Swagger. Originally, I wasn’t thrilled by the prospect of defining the API in YAML/JSON, but I’ve come around to that approach. However, the default template for Swagger UI made me a bit queasy!

Even though Swagger was only the standard compatible with Apigee and Mashape, that default theme meant I had to explore further. API Blueprint from the team at Apiary was the next possibility. I love the simple appeal of using Markdown to write documentation. API Blueprint looked to have all the essentials figured out, and some solid tooling for building and using those templates, including conversion to Swagger. In particular, the documentation generation via Aglio looked very promising. After sorting out some Node issues, it works well out of the box. However, it fell foul of twisted templating and its jade templates are unfortunately too Frankenstein for my taste!

Although Slate is a winning design template, its static Markdown does not immediately offer any portability. However, API-Blueprint proves it can be done with enough effort and some awkward gymnastics.

That left RAML. On the standards side, I couldn’t see any immediate advantages versus Swagger. On the documentation output, designer input seems lacking.

A Crazy Idea

It is clear that the Apiary team have worked Herculean hours to build the API Blueprint ecosystem. The immediate appeal of Markdown vs. YAML is huge. However…

Using Markdown as a semantic language is a bad idea. Using Markdown as a data-modelling language is also a bad idea. For all but the most simple schema, Markdown just won’t do. For light schema, YAML make sense. For a bit more power, JSON is great. For highly structured schema, XML is the best option we have.

What about using HTML? Since ng- attributes started to infuse our HTML, it’s been clear how we could map all manner of meaning, structure, and functionality onto our otherwise unharmed HTML. I don’t want to coerce Markdown to be something it isn’t. But I do like having static files which render nicely in a browser. Why not just decorate vanilla HTML with some semantic attributes? Then I can write my API docs in HTML, and extract a Swagger or RAML schema with a few querySelectors?

There is no rocket science in this idea. I’m just surprised that in the landscape of API documentation approaches, there isn’t a standard for api-related HTML attributes.

Postscript

The API project from which this post evolved is itself still a work in progress. More details to follow. We have implemented a simple x-api attribute schema in the HTML. However, for all but the smallest API, it is much more compact and sane to keep the API information in javascript. Then we can easily render the HTML with Gulp through some very lean Mustache templates.

Notes:

  1. Mashape now owns Gelato. Although Gelato supports API-Blueprint, Mashape does not. I expect Mashape will integrate Gelato in due course.
  2. I had a few small gripes with Slate. One is that it uses Middleman. Middleman is great, but ideally I want to serve a static site directly from Nginx of Apache. It was a quick job to strip out Middleman and run all the static file generation through Gulp. It took a little more work to clean out the custom javascript in favour of standard modules to pull in with Bower.
  3. Mulesoft has garnered support from big enterprise software providers, and possibly has a better specification.
  4. Where would we be without @gruber?! Of course Markdown is structured, itÙs just not meant for complex, nested structures.
  5. Although Angular was not the first to use custom attributes, it certainly popularized the approach. As a subset of XML, HTML always had the possibility. Though a rather horrible spec with some even worse implementations, iXBRL demonstrates how far the approach can be used / abused.
  6. Neither Stripe nor the default Swagger UI outputs use any custom attributes. The Swagger Editor and RAML API Editor only use Angular attributes.
  7. As we want to factor out common bits of the API spec, we need more than just JSON.

10 Tips for Business Founders

I recently experimented with LinkedIn Pulse, where I highlighted some points covered in greater depth by Founders Path.

The intended audience is non-technical, particularly non-technical founders. Although I may experiment with Quora and Medium, I expect to keep posting here on software, datasets and other bits.

Calories of a different kind

Although feeding a population is the lynchpin of civilisation, it’s nice to have calories to feed our machines as well! Just as it is difficult to find how many food calories are produced by each country, data for non-food calories can also be a challenge. However, a few key sources provide a great start:

First I use the proven reserve figures for coal, oil, natural gas,, and uranium. Then I have used current consumption levels for solar, wind, hydro, geothermal and other renewables, multiplied by estimated years of supply. Where available, figures in TWh are used directly. Elsewhere, estimated conversion factors are used.

Putting all of that together, the top 10 countries endowed with the most energy are:

CountryTWhTop 2 Sources
United States1,339,772Coal (89%), gas
Russia1,128,329Coal (70%), gas
China673,882Coal (85%), hydro
Australia477,389Coal (80%), uranium
Venezuela384,224Oil (90%), gas
Iran382,632Gas (52%), oil
Saudi Arabia357,447Oil (87%), gas
India330,010Coal (92%), hydro
Canada285,560Oil (70%), hydro
Kazakhstan244,868Coal (69%), oil

As with agricultural energy, other energy reserves also favour large countries. The availability of coal is worth noting, despite being politically out of favour. Looking at the per capita analysis, this time reveals only a slightly different picture:

CountrykW years / personTop 2 Sources
Greenland16,495Uranium (100%)
Qatar8,841Gas (83%), Oil
Kuwait3,759Oil (92%), Gas
Australia2,278Coal (80%), uranium
Turkmenistan2,190Gas (99%), oil
UAE1,858Oil (76%), gas
Kazakhstan1,589Coal (69%), oil
Venezuela1,428Oil (90%), gas
Saudi Arabia1,294Oil (87%), gas
Libya1,180Oil (87%), gas

In the above analysis, Greenland is perhaps the only surprise with it’s very small population and its 221,200 tonnes of uranium which would be financially and environmentall expensive to extract.

With few real insights amongst the countries well-endowed with energy resources, what about the countries with the smallest energy reserves? Here, we just look at the 10 most populous countries with less than 2 kW years / capita of energy reserves:

CountryPopulation
Bangladesh161
Philippines101
Ethiopia99
Kenya46
Uganda39
Afghanistan33
Nepal29
Madagascar24
Taiwan23
Cameroon23

Unsurprisingly, all of these countries face developmental challenges apart from Taiwan. Many, however, should have opportunities to harness their renewable energy resources in the future, and I hope to create a better picture of renewable potential in subsequent posts. However, based upon this analysis, these countries are currently best capturing their renewable energy sources:

CountryRenewable kW years / personTop 2 Sources
Iceland309Hydro (74%), geothermal
Norway122Hydro (23%), wind
Andorra97Hydro (100%)
Bhutan77Hydro (99%)
Paraguay53Hydro (100%)
Canada51Hydro (5%), wind
Montenegro43Hydro (100%)
Dominica42Geothermal (93%), hydro
Sweden40Hydro (67%), wind
New Zealand35Hydro (23%), geothermal

Amongst these countries, Iceland, Paraguay, and Canada also featured in our top producers of food calories per capita, and most of these countries enjoy high standards of living.

Next Steps

The treatment here of renewable energy reserves is based upon actual installed capacity, rather than potential capacity. I hope to crunch through some of the NCDC / NCEI data as the next step in better assessing renewable energy reserves.

Notes:

  1. Particularly “renewable” energy sources, e.g., solar, wind, hydro, sustainable biomass, etc.
  2. Coal data is from BP and is the total proved reserves of Anthracite, Sub-bituminous, bituminus, and lignite, i.e., no distinction is made in quality despite the different thermal properties of each.
  3. Oil and gas data is also from BP. Although the principal consumption of oil is to power petrol and diesel engines, it can be used to power electrical generators as well, which is the implicit assumption of this analysis.
  4. Uranium reserve levels are from the Red Book, and include all deposits with extraction costs up to USD 260 kg / U.
  5. I have used consumption figures rather than production capacity where available, as capacity levels are often unreliable, particularly for environmental energy sources.
  6. Wind, solar, hydro and geothermal consumption levels are from BP where available. Where BP has not provided the information CIA estimates for installed electricity production capacity of hydro-electric and other renewable sources are used.
  7. 40 years has been used for purpose of this analysis, though depending upon the particular renewable resource, it should be available for many generations.
  8. Terra-watt hours
  9. Assumptions as follows:
    • A barrel of oil weighs 136.4kg, and generates between 4 and 13 kWh / kg.
    • A cubic metre of natural gas weighs 0.85kg, and generates between 4 and 8 kWh / kg.
    • A kg of coal generates between 2 and 8 kWh / kg.
    • The figures available for the energy released by the fission of a kilogram of uranium are theoretically between 139 MWh and 168 MWh (assuming 0.7% U-235 per kg of raw uranium. However, for this analysis we use the actual energy production rates from 2012 of 37.7MWh from p. 77 of the Red Book.
  10. Newer, cleaner, more efficient coal fuelled power plants may not be given proper consideration because of the ecological impact of their older, dirtier predecessors. The EIA has compiled some fascinating comparative statistics, in particular for different coal technologies.
  11. Calculated as kWh * 24 * 365 divided by the 2015 UN population projections where available, or the CIA population estimate otherwise.
  12. There are 27 in total with less than 2 kW years / capita
  13. In millions, as per UN where available, or CIA otherwise.
  14. An all of the remaining 17 countries, apart from Hong Kong, Singapore and possibly Lebanon.
  15. In particular, using data from the NCDC.
  16. Top 2 renewable sources. The percentage indicated is in relation to the total identified energy resources of the country, rather than total electricity consumption. There are many countries which have renewable resources, but also import fossil fuels or electricity directly to meet their domestic needs. Of particular note, Montenegro and Dominica both produce 20-25% of their electricity from fossil-fuels which are assumed to be imported, and Andorra imports c. 60% of its electricity directly from Spain. Although coal reserves are the major reserves held by New Zealand, their actual production is c. 55% hydro.

Counting Calories

In putting together one of many educational “web toys” for my long-suffering children, I wanted to give them a picture of how well different countries feed their populations. At the time, I was unable to find annual calorie production data. However, the FAO does maintain a number of great datasets. I started with three datasets in particular:

Having imported the production tonnage for each of the various products, I undertook the rather tedious task of estimating (a) yields of edible material from reported weights, and (b) breakdown of protein, fat, and carbohydrate. Then using the 4-4-9 method of calorie estimation, we can build bottom-up calorie production figures. The top 10 calorie producers aren’t terribly surprising:

CountryTrillion kcalTop 3 Sources
China2,649Rice, wheat, maize
Indonesia1,801Palm oil, rice, coconuts
India1,600Rice, wheat, sugar cane
USA1,388Soybeans, maize, wheat
Brazil1,087Soybeans, sugar cane, maize
Malaysia1,076Palm oil, palm kernels, rice
Russia441Wheat, barley, sugar beet
Thailand401Rice, palm oil, cassava
Argentina378Soybeans, maize, barley
Nigeria348Cassava, palm oil, yams

In general, large countries with large populations produce a lot of calories! However, in terms of potential quality of life, calories per capita is a bit more interesting:

CountryDaily kcal / person Top 3 Sources
Malaysia97,182Palm oil, palm kernels, rice
Iceland32,373Capelin, cod, herring
Paraguay25,683Soybeans, cassava, maize
Uruguay24,051Soybeans, rice, wheat
Argentina23,855Soybeans, maize, barley
Denmark23,638Barley, wheat, pork
Canada23,240Wheat, rapeseed, barley
Australia19,188Wheat, barley, rapeseed
Indonesia19,153Palm oil, rice, coconuts
Solomon Islands19,120Palm oil, coconuts, sweet potatoes

That change produces a bit of different picture, particularly if we ignore the palm oil skew. Now the Nordics and the south of South America evidence their relative calorie wealth.

Next Steps

Equipped with this food energy data, we can start to incorporate some of the other datasets, to form a more detailed picture of possible self-sufficiency and potential quality of life at the national level.

Notes:

  1. Or more accurately, how well they are capable of feeding their populations, assuming food was distributed appropriately.
  2. There are currently multiple sites run by the FAO. Faostat3 is the latest, but still in development, and does not include fishing statistics, which were available on an older site. The data used here is for 2013, which looks to be the last complete year of data from the FAO.
  3. I have not yet incorporated fish production figures from the small but growing aquaculture fish farms.
  4. Those working assumptions can be found here. The estimates are very crude, but hopefully in the right ‘ballpark’. If you see any grave errors, please let me know via email: josh.brayman at marketstack.com
  5. See, for example, these FAO notes
  6. Per annum
  7. As fish are reported by the FAO as individual species, in general fish appear here has top sources only for small island nations.
  8. Total calorie values for Indonesia and Malaysia in particular are skewed heavily by their massive palm oil production. Although palm oil is not generally consumed on its own, it is packed with calories, and I assume they could produce alternative calories with the same acreage.
  9. Ibid.
  10. Using UN population estimates for 2015.
  11. This video extols the tasty and interesting merits of cassava as a calorie source
  12. See #8 above
  13. Particularly for larger countries, national aggregates hide stark regional differences.

Data Exploration

Having reached the end of the technical mistake list with the last post, I hope to continue to write, AND avoid the pitfall of noise.

I’m interested in datapoints which bring us closer to understanding human phenomena – past, present, and future. Claus Moser expressed the sentiment that statistics “cast light on people’s lives”. However, as Twain popularized, statistics are often just “damned lies”.

Despite a healthy skepticism for statistics in general, a few of the data sources which have piqued my interest include:

The next few posts will explore some of the partial-truths about our civilization hinted at by these datasets, separately, and mashed up!

Notes:

  1. Although not all of those mistakes have been covered in detail, all the interesting bits (and some dull ones) should be covered!
  2. Source: Economist, 12 September 2015, p. 92. Unfortunately, I cannot seem to find the original quote.
  3. See Wikipedia for some discussion of the actual origin of the phrase “lies, damned lies, and statistics”
  4. There are currently multiple sites run by the FAO. Faostat3 is the latest, but still in development, and does not include fishing statistics, which were available on an older site.
  5. Like Faostat3, this endpoint is still in development. Historic data does not look to be complete, and not all countries report their exports with the same completeness.
  6. Much of the World Bank dataset is based upon Unctad
  7. Discovering Maddison is a joy. Few other economists or historians come close to his perspective.
  8. Profit motive may bias some of the figures, but having cross-checked a small sample with other sources, the data looks to have high integrity. The Energy Information Administration also provides a wealth of fascinating datapoints, which may again be subject to bias, and is largely limited to the US.
  9. Real and imagined!

It’s the data, stupid

First time out with MarketStack, we did not plan our data persistence based upon even our base-case data access needs. That was obviously a mistake!

Given the clunky architecture we originally had for MarketStack data access, I finally got around to doing some informal benchmarking of various databases and access methods for similar use-cases. For the databases, I just used what I happened to already have on my local machine, namely:

I thought about adding some others, but couldn’t be bothered.

Source Data

The data for the experiments was c. 15,000 news items scraped from the web over the last few months. The stats for migrating the data to each of the four databases are as follows and can also be found here:

MariaDB PostgreSQL Neo4j Mongo
Total Items
Number 1,589,492 1,589,492 1,588,908 1,588,908
Space on Disk (mb) 418 169 1,741 1,667
Average Space (b) 276 111 1,149 1,100
Content Items
Number 15,025 15,025 15,025 15,025
Space on Disk (mb) 283 17 581 556
Average Space (b) 19,722 1,186 40,517 38,798
Stem Items
Number 1,574,467 1,574,467 1,573,883 1,573,883
Space on Disk (mb) 136 152 1,160 1,111
Average Space (b) 90 101 773 740
Write Times (ms)
Total 4,496,505 3,681,421 7,632,621 4,343,437
Average / Item 2.8 2.3 4.8 2.7
Average / kb 10.5 21.3 4.3 2.5
Stdev 16% 22% 25% 14%

Unsurprisingly, the SQL databases exhibited better optimisation of disk space, with Postgre standing out in particular. Postgres also offered the best average write speed. Unsurprisingly Neo4j took up the most disk space (albeit not by much), and had the slowest writes. Write speed was fairly consistent for each of the four databases.

First Pass Results

Having migrated all the data successfully, first order of business was to repeat the naïveté of the earliest MarketStack days and just do search the databases directly using regex and boolean expressions:

Search Times (ms)
MariaDB PostgreSQL Neo4j Mongo
Top 10 Stems 11,085 49,894 140,412 22,844
china AND germany 22,306 33,341 228,652 41,623
iraq AND NOT islam 9,101 7,191 220,142 17,703
iran OR nuclear 23,943 8,584 213,550 25,858

Again, no real surprises. All of the searches performed poorly, with Neo4j an order of magnitude worse than most of the others. However, that was running with a close-to-worst-case query of the form:

CYPHER

  MATCH (c:Content)-[r]-(s:Stem)
  WITH DISTINCT c.title AS t, count(DISTINCT s.stem) AS count, 
  str(collect(s.stem)) AS stems
  WHERE stems =~ '(?i).*iran.*'
  OR stems =~ '(?i).*nuclear.*'
  RETURN t,count,stems"
		

For the SQL databases, the query wasn’t much better:

SQL

  SELECT title FROM st_content 
  WHERE (
  	  LOWER(stems) LIKE '%iran%'
  	  OR LOWER(stems) LIKE '%nuclear%'
  )
		

And for mongo:

MONGO

  db.content.find(
  	  			{'data.nlp.stems':{'$in':['iran','nuclear']}},
  	  			{'_id'=>0,'meta.title'=>1}
  	  		)
		

ID Indexing

Given the first pass was informative but didn’t work very well, next step was to see how the out-of-the-box indexing worked for each database. As part of that testing, I used two basic sets of queries - one to generate N random-ish ids (using a random offset), and one to retrieve the content associated with those ids:

SQL

/* Run n times with varying random integers 
 between 1 and 15,000 for the offset */
SELECT _id FROM st_content LIMIT 1 OFFSET :offset

/* Using the results from the query above, run the following: */
SELECT title FROM st_content WHERE _id IN (:_ids)
		

CYPHER

/* Random id */ 
MATCH (n:Content) RETURN id(n) AS i SKIP {_random} LIMIT 1

/* Content retrieval */
MATCH n WHERE id(n)={_ids} RETURN n.title AS t
		

MONGO

// repeat n-times for the ids  
db.content.find(
  	  			{},
  	  			{'_id'=>1}
  	  		).limit(1).skip(_random)

// retrieve the content
db.content.find(
				{'_id':{'$in':[_ids}},
				{'_id'=>0,'meta.title'=>1}
			)
		

The results spoke for themselves, with all databases performing well:

Execution Times (ms)
MariaDB PostgreSQL Neo4j Mongo
ID generation 8.9 3.9 35.2 11.3
Retrieval by ID 10.7 3.7 44.1 11.6

Stem indexing

With those promising id results, it was time to use the indexing to repeat our searches. Although it would require more steps and multiple return trips to disk, the performance could hardly be worse than our first naïve attempt! The steps generally were:

  • Get the content ids from the stems for each search term
  • Perform the boolean logic in the application, by manipulating the resulting id arrays
  • Retrieve the content with the resulting set of content ids

This time the queries followed the following outline:

SQL

/* Get the content ids according to the logical operators */
SELECT content_id FROM st_stem WHERE LOWER(stem) IN ('china');
SELECT content_id FROM st_stem WHERE LOWER(stem) IN ('germany');

/* After doing the set operations, run the familiar query */
SELECT title FROM st_content WHERE _id IN (:_ids)
		

CYPHER

MATCH (s:Stem)-[r:STEM_IN]-(c:Content) WHERE s.stem IN ['china'] RETURN id(c) AS i;
MATCH (s:Stem)-[r:STEM_IN]-(c:Content) WHERE s.stem IN ['germany'] RETURN id(c) AS i;

/* Content retrieval */
MATCH n WHERE id(n)={_ids} RETURN n.title AS t
		

MONGO

// created an index on meta.stem and avoided using '$in'
db.stems.find(
				{'meta.stem':stem},
				{'_id'=>0,'data.content.id'=>1}
			)

// retrieve the content
db.content.find(
				{'_id':{'$in':[_ids}},
				{'_id'=>0,'meta.title'=>1}
			)
	

Using that, the results were:

Search Times (ms)
MariaDB PostgreSQL Neo4j Mongo
china AND germany 12 57 5,915 4,474
iraq AND NOT islam 18 46 6,610 6,254
iran OR nuclear 6 21 3,724 1,898

Surprises

I was surprised by the results in a number of ways:

  • Just how bad the results of worst-case search were, particularly for Neo
  • Just how good the write performance and disk usage was for PostgreSQL
  • Just how good PostgreSQL was with indexed values
  • Just how good MariaDB was without indeces, particularly with the benefit of its caching
  • Just how quick Ruby was in doing the set operations

Further Investigation

Amongst the areas for further investigation to account for the differences in results:

  • Performance of the various ruby drivers
  • Differences in the ruby application code
  • Caching behaviours

Conclusions

Subject to the further investigations above, the headline figures suggest that despite SQL being uncool for some time now, and my aversion to schemas, particularly for prototyping, at least for a certain size of data, SQL seems to offer a performance edge worth considering. Maybe the old MarketStack approach wasn’t that far off after all?!

Notes:

  1. In a fit of madness, I am contemplating a reboot of sorts for MarketStack.
  2. We used mostly normalised data persisted first in MySQL and then in MariaDB. Our core tables were each 1-5 million rows. Any structured queries were impossibly slow, and nested SELECT queries crashed the database in the MySQL days. Effectively duplicating indexing tables in Redis brought performance back to a passable level, though our architecture was still sh1tfully slow. Thanks again to Wayne Moore for the Redis recommendation. Once we had the indeces from Redis, we used JDBC / Hibernate to access the data through our thousands of lines of nasty, SOA-done-horribly-wrong Java code.
  3. The stats for this blog are not meant to constitute a rigorous analysis!
  4. My long-lived but somewhat knackered Acer Aspire 4810T with 3GB of RAM running Ubuntu 14.04.1 LTS x86-64
  5. MariaDB version 10.0.4 with version 0.3.15 of the mysql2 gem.
  6. PostgreSQL version 9.3.3 with version 0.17.1 of the pg gem.
  7. Neo4j version 2.1.7 with version 4.0.5 of the neo4j-core gem. Version 2.2 apparently offers some significant performance gains, but laziness prevented upgrading.
  8. Mongo version 2.6.1 with version 1.10.0 of the mongo gem.
  9. In particular, I made a start with Cassandra, but decided it was too old and too Java. I also looked at Hive, but again, all the associated Java infrastructure put me off.
  10. The items were persisted in another Mongo instance.
  11. The disk space for the Neo4j content items is estimated as the total disk space pro-rated by the same relation between content items and stem items observed in Mongo.
  12. The items had some light NLP pre-processing to extract tokens and their stems.
  13. Not a true standard deviation. As can be seen in the raw data, the 15k content items were migrated in 15 uneven batches. This “standard deviation” is calculated as the standard deviation of the mean write time for each of those 15 batches, divided by the overall mean time.
  14. Mean time across 3-4 runs.
  15. Top 10 most frequently occuring stems across all the content.
  16. Figures are mean execution times per id. Three trials were run for each of 15 and 50 ids. Although comfortably fast, the SQL databases performed least consistently.
  17. After a first poor attempt, creating indeces on the stem for Neo4j, Mongo and Postgres (100x improvement) made a big difference for all. The difference for Neo was massive. MariaDB didn’t seem to need it.
  18. Namely, set addition for OR, intersects for AND, and subtraction for AND NOT
  19. Mean search times across 3-4 runs. It is worth noting that MariaDB effectively caches recent search results, meaning subsequent runs are < 10ms each.
  20. 7 results
  21. 152 results
  22. 232 results
  23. MongoDB is my first choice for prototyping at the moment.

Debugging iOS Safari on Linux

Largely unrelated to MarketStack mistakes, I recently had to debug some HTML audio and video issues on the iPad, and came across iOS Webkit Debug Proxy which allowed me to debug the iPad on my Linux laptop. My configuration was as follows:

  • Ubuntu 14.04.01 LTS (x86_64)
  • Chrome 40.0.x
  • iOS 7.0.4

Thanks to Pål Ruud for his gist, which, along with the project instructions got me started. However, there were a few changes / pieces I was missing, which I thought I’d note here for at least my own future reference!

On the installation:

ios_debug_install.sh

  #!/bin/bash
  curdir=$PWD
  homedir=$HOME
  mkdir ~/ios-proxy && cd $_
  git clone https://github.com/google/ios-webkit-debug-proxy
  sudo apt-get install \
  autoconf automake \
  libusb-dev libusb-1.0-0-dev \
  usbmuxd \
  libimobiledevice-dev
  wget http://www.libimobiledevice.org/downloads/libplist-1.11.tar.bz2
  tar -xvf libplist-1.11.tar.bz2
  cd libplist-1.11/
  ./configure --prefix=/home/josh/ios-proxy/ && make && make install
  cd ../ios-webkit-debug-proxy
  LDFLAGS="-L${homedir}/ios-proxy/lib/" CFLAGS="-I${homedir}/ios-proxy/include/" ./configure --prefix=${homedir}/ios-proxy
  ./autogen.sh
  make && make install
		

Once that has installed correctly, on the iPad, select Settings -> Safari -> Advanced -> Web Inspector

Then connect the iPad to a USB on the Linux device, and allow the connection on the popup on the iPad.

Now launch chrome on Ubuntu, but be sure to use the remote debugging flag google-chrome-stable --remote-debugging-port=9222 &

Then fire up the webkit proxy:

safari-proxy

  #!/bin/bash
  LD_PRELOAD=${HOME}/ios-proxy/lib/libplist.so ${HOME}/ios-proxy/bin/ios_webkit_debug_proxy		
		

After which you should see the iOS device connected in the shell output:

  Listing devices on :9221
  Connected :9223 to iPad (53134bbf5bcef6fcd4165c3419487)
		

One can then navigate to http://localhost:9221/json to confirm that the device is attached, but it should just mirror the shell output from running safari-proxy. More interestingly, http://localhost:9223/json should list the sites open on iOS, assuming iOS is awake and Safari is open. The key piece of information from http://localhost:9223/json is the page number of the tab you want to debug. I had no success with the urls listed there, but was able to connect to the tab by opening: http://localhost:9222/devtools/devtools.html?ws=localhost:9223/devtools/page/{PAGE_NUMBER}.

I did experience the proxy and Safari sporadically crashing with some segfault shown in the shell. Also, the javascript console shown in Chrome seems to be one-directional, as I could only see output from iOS rather than see that any code typed into the console was executed by Safari. However, navigating elements in Chrome does highlight them in Safari.

All in all, a very useful toolkit!

Notes:

  1. We bought an iPad for MarketStack. That might have been a mistake. Unlike most of the world, and an increasingly high proportion of developers, I don’t actually like Apple products.
  2. I could not get the iOS proxy to work on a 35.x version Chrome.
  3. I was running it on 9222 which is in the range of ports used by the proxy server, but it can probably be run on any free port.

Logging Responsibly

We used log4j with Spring. That was actually okay.

The following guidelines did work well for us:

  • Logging IS essential: without it, we never would have been able to figure out all of our other mistakes! In development, I have no idea how some folks work without some decent logging. In production, how else will you figure out why stuff is slow or broken?
  • Details are important: without good detail, the log file can verge on useless. At a minimum, log entries should include:
    1. timestamp, down to the millisecond, as performance drags add up a few ms at a time
    2. origin, by which I mean the class and method generating the log statement. Without that information, it is very difficult to track down the problems
    3. thread id if running a multithreaded application. Again, without this information, tracking down issues is nigh on impossible.
    4. request id if your application is running many concurrent requests.
    5. meaningful messages instead of the n00bs “the code is here!” style message
  • Use files: What could be worse than logging to the console / System.out on a server?
  • Manage the files: Use daily rolling logs, and archive or delete the old stuff as appropriate. It can be all too easy to fill a hard drive with ancient log files!

Unsurprisingly, a few caveats from our experience as well:

  • Beware the dependencies: In the minefield of Java, if you plumb the depths of TRACE, and sometimes even DEBUG, you will see what prolific logging really looks like, as your dependencies (e.g., Spring) take over your log files and render them practically useless. If you do not want to hear from those dependencies (occassionally, you do), be sure to configure your logging accordingly.
  • Keep it brief: Do not log full response bodies from various web services, whether they be your own or otherwise. They are long, and in production you just don’ want to see that stuff.
  • Beware the loop: It should be obvious, but if you are running a loop with hundreds or thousands (or more!) iterations, try to avoid logging anything inside that loop.

Experimental Data:

In order to spice things up with a bit of data, I have compared the logging and computational performance across four languages on six different platforms. With each setup, I just calculated some Riemann sums for a Chi distribution. I used steps of 0.01 from zero to six (600 steps) just to get the code to do a small amount of work. All of the code is available on Bitbucket.

Processing Time (ms)
Logging Disabled Logging Enabled Multiple
JS (in Browser) 11 752 67x
JS (Node) 7 631 96x
PHP 24 1,480 62x
Ruby 104 3,883 37x
C# 26 6,599 252x
Java (Spring) 25 1,856 75x
Java (Dropwizard) 32 60 2x

Observations:

Just based upon the experimental process and results:
  • Node offers very fast performance, as well as very quick development
  • Sadly, though Ruby is an absolute pleasure to code, it is a bit of a performance laggard
  • Dropwizard is a very nice bit of kit from the Yammer folks
  • Spring Boot was so promising, but it brought back all the nightmares of temperamental Spring configurations, painfully slow build times, and very average performance.

Notes:

  1. Though Spring + log4j was far from optimal, particularly as the results in the table suggest.
  2. Arithmetic mean of just 10 runs per data point.
  3. Processing time with logging enabled divided by processing time with logging disabled.
  4. JS in the browser was just logging to console, with the console window closed.
  5. Node application used log4js.
  6. PHP demo used log4php for this experiment, though there are some leaner, bespoke approaches.
  7. Ruby implementation used a customised version of the standard logger gem. See code for details of modifications.
  8. C# demo used log4net, though not sure why the performance was so exceptionally poor!
  9. Java + Spring used spring-boot-starter-log4j. No sure why the performance was so poor.
  10. The Java Dropwizard implementation uses sl4j plus some of their own wizardry. Lightning fast! Suspect they spin up a separate thread for the file I/O, but need to check.

Twisted Templating

For far too long, we used Freemarker, and that was a mistake.

Based upon the painful experience we had with Freemarker, I now have a checklist for templating languages:

  • Minimal: As close to HTML / Javascript as possible - it should be perfectly readable by a front-end developer without any knowledge of the back end language
  • Data Objects: Easy access to data objects with minimal jiggery pokery
  • Interpreted: Changes on the fly without any need to recompile
  • Setup: Should be trivial to switch the web application to and from the templating language
  • Documentation: Must be well documented either through primary documentation or an active developer community

Our Freemarker Hell

At first glance, Freemarker might pass most of the criteria above. A simple template from the MarketStack archives:

news-snippet.ftl

<!-- id parameters to be implemented in Java -->
<#macro mainnewswidgetpreview channel item>
	<div class = "news-item-container" id = "newspreview${item.id}">
		<div class = "bordertop">&nbsp;</div>
		<div class = "news-table" >
			<div class = "news-preview">
				<#assign newsItem = item>
				<br />
				<p><b>${newsItem.title}</b></p>
				<p><span class = "news-preview-source">(${channel}:  
					<#if ((newsItem.pubDate??) && (newsItem.pubDate?size > 0)) >
						${newsItem.pubDate?substring(0,newsItem.pubDate?last_index_of("2012")+4)})
					<#else>
						--
					</#if>			
				</span></p>
				<br />
				${newsItem.description}
				<br /><!-- gotta love some hard-coded styling -->
				<a href = "${newsItem.link}" target = "_blank">&lt;Full Story &gt;</a>
				<br />
			</div><!-- end news-preview -->
		</div><!-- end news-table -->	
	</div><!-- end news-item-container -->
</#macro>	

HOWEVER, behind that apparent simplicity sat the all-too-common Java nastiness. In this particular case, here is but one class required to actually get the templates to work:

package com.marketstack.ui.servlet;

import com.marketstack.ui.util.StackTemplateExceptionHandler;
import com.marketstack.util.StackLogger;

import java.io.IOException;
import java.util.HashMap;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import freemarker.template.Template;
import freemarker.template.TemplateModel;
import freemarker.template.Configuration;
import freemarker.template.SimpleHash;
import freemarker.ext.servlet.FreemarkerServlet;

import freemarker.ext.dom.NodeModel;

public class ControllerServlet extends FreemarkerServlet {
	private StackLogger logger = new StackLogger(this);
    private Configuration cfg;
    private String parent_init = "true";
    private TemplateModel model;
    private HashMap mymodel;
	protected Boolean printStackTrace, throwTemplateException;
    
    public void init() {
    	logger.log("debug", "ControllerServlet init() called");
		try{
    		super.init();
    		cfg = getConfiguration();
    	}catch (Exception e){
    		// not the best handling of the situation...
    		cfg = new Configuration();
    		parent_init = e.getMessage();
			logger.log("debug", "init() getConfiguration() exception "+e);
    	}
        // - Templates are stoted in the WEB-INF/templates directory of the Web app.
        cfg.setServletContextForTemplateLoading(
                getServletContext(), "WEB-INF/templates");
		
		// required for incorporating Freemarker variables into Xpath
		try{
			freemarker.ext.dom.NodeModel.useJaxenXPathSupport();
		}catch (Exception e){
			/* 
			 * Jaxen classes not present, so will fall back to Xalan, but need to deal with that properly if FTLs are using Jaxen functionality!
			 */	
		}
    }
    
    /**
     * Called before the execution is passed to template.process().
     * This is a generic hook you might use in subclasses to perform a specific
     * action before the template is processed.
     *
     * @param request 
     * @param response
     * @param template	 	the template that will get executed
     * @param data 			the data that will be passed to the template. By default this will be
     *        an {@link AllHttpScopesHashModel} (which is a {@link freemarker.template.SimpleHash} subclass).
     *        Thus, you can add new variables to the data-model with the
     *        {@link freemarker.template.SimpleHash#put(String, Object)} subclass) method.
     * @return true to process the template, false to suppress template processing.
     */
    protected boolean preTemplateProcess(
        HttpServletRequest request,
        HttpServletResponse response,
        Template template,
        TemplateModel data)
        throws ServletException, IOException{

    	Object session = request.getSession();
    	((SimpleHash) data).put("stack_session", session);
    			
		try{
			StackTemplateExceptionHandler handler = new StackTemplateExceptionHandler();
			handler.setTrace(false);
			handler.setThrowException(true);
			template.getConfiguration().setTemplateExceptionHandler(handler);  
		}catch(Exception e){
			logger.log("debug","preTemplateProcess() encountered exception with init params and setTemplateExceptionHandler "+e);
		}
    	putVars(request, response, data);
    	return true;
    }

	/**
	 * Empty method which should be overridden by subclasses to put more specific data into
	 * the data model available to the templates
	 */
    protected void putVars(HttpServletRequest request,
    						HttpServletResponse response,
    						TemplateModel data){}

}//end class	

And that class required a subclass to actually do something useful!

Other Java minefields

Freemarker is not alone in it’s distinction as a poor templating language. However, as illustrated above, the Freemarker syntax itself borders on reasonable, but the Java piece is its downfall. Velocity shares the same shortcoming.

Both Freemarker and Velocity were designed certain shortcomings in the all-too-common JSP ( Java Server Pages). JSP is unquestionably powerful (particularly for hackers of Java web application servers!), but as a templating language it is difficult to achieve clean MVC separation. Like ASP, most JSP contains substantial inline Java programming rather than simple front-end views.

Sane alternatives

There are some beautiful templating languages out there. The elegance of Slim stands out as an exceptional example. ERB is okay, but the JSP-style <% directives %> freak me out a bit. Twig from the folks at Sensio Labs really does prettify and simplify the usual PHP spaghetti. In fact, Twig almost looks like the Javascript goodness of Handlebars and its inspired forefather, Mustache.

Conclusions:

Java may have its place in the world, but front-end templating isn’ it!

Notes:

  1. None this time!

Mind Your Scaffolding

We used Roo. That was a mistake.

Although there are a number of Roo-specific points which made it unsuitable for us, the experience left me biased against “rapid development” tools (“RDTs”).

I had a tinker with a few RDTs just to compare them and see how Roo stacked up:

    Rapid Development Tools

  • Java: Roo

    As for Roo, the results of a relatively simple MVC scaffold template (a pizzashop in this case), speak for themselves:

    Roo pizzashop MVC

    The extent of the file propagation is clear — it’s ugly!

    Some alternatives to Roo (for Java / Scala / Groovy development) include:

  • PHP: CakePHP

    CakePHP was fairly satisfying to use. The output for the tutorial ToDo webapp is certainly cleaner than Roo:

    CakePHP ToDo MVC

    After a bit of fiddling, CakePHP was remarkably tasty with its bake scaffolding command.

    Although there are countless CRM frameworks built in PHP, some of the more prominent RDTs include:

  • Python: Django

    Next to Rails, Django is probably the best known and perhaps most widely used RDT. At first glance, it offers a very lean output:

    Django requires remarkably few files

    However, two caveats:

    • some of those files contain what would, in many other languages, spread across multiple files, e.g., multiple classes defined in a single file
    • much of Django’s rapid development power relies upon customising the admin functionality as a template
  • Ruby: Rails

    Everyone wants to be Rails. As an RDT, that envious position is probably well deserved. The file output for the blog tutorial is a bit heavy, but not terrible:

    Rails Blog — a scaffolding classic

    I can genuinely see why Rails has such a vibrant and growing community. For many use case, it really can facilitate more rapid development.

  • Javascript

    Although Javascript is awash with front-end frameworks such as Backbone, Knockout, and AngularJS, server-side NodeJS frameworks are in their infancy. However, there appear to be some very promising starts from:

  • C# / .NET

    Although there are some .NET tools, my impression is that most RDT–like functionality is built into the .NET IDEs like Visual Studio. For Mono users who are developing outside of Windows, there are few RDT solutions.

RDTs in general

The key obstacles to an RDT actually improving my already low rates of personal productivity include:

  • File explosion: the tool generates an unwieldy number of files which require editing or maintenance
  • File formats: the tool uses a non-native dialect
  • No command line: the tool requires a high–overhead, full–blown IDE, as opposed to a nice, lean, unobtrusive, command line executable

Our Roo experience

Our Roo experience kinda sucked for many reasons, and to be fair to Roo, the majority of those reasons were our own misuse of it. However, factors contributing to the sucky experience included:

  • File explosion: utterly out of control.
  • AspectJ: hated it. Each file comes with its own lovely warning:
    // WARNING: DO NOT EDIT THIS FILE. THIS FILE IS MANAGED BY SPRING ROO.
  • IDE: even without going as far as using the full–blown STS, the file explosion and the AspectJ mean you can’t get away from using at least the Roo shell on an ongoing basis.
  • CRUD: through no fault of at all of Roo, the auto–generated CRUD interface did not work for us
  • Templating: the Spring MVC dialect of JSP (with Spring tags) was pure pain
  • Authentication: again through no fault of Roo, we had some initial teething problems with the Spring authentication

On the plus side,

  • We liked the rich object model
  • If you don’t look at the JSP side of it, the ui.ModelMap works like magic!

Conclusions:

  • Some RDTs, in some use cases, do save time
  • For a very quick throw–away demo or wireframe, the right RDT can be perfect
  • For a real application, RDTs or their design patterns are never a subsitute for actually designing (and iterating) your own application architecture.

Notes:

  1. Quite often, RDTs are referred to as “scaffolding” tools, but many common RDTs do not offer true scaffolding functionality, such as that provided by rails generate. In particular, neither Django nor CakePHP appear to offer command line model generation. CakePHP can bake from an existing database, and Django can start from either a database or manually crafted code in models.py
  2. Though I did not run Grails directly, it looks to follow many of Roo’s patterns rather closely.
  3. Play looks to be motivated by Scala afficionados, and though I have not used it, from the documentation, it looks to have some bulky overhead.
  4. Lift is another framework from a team including long time developer David Pollak.
  5. In the quick demo I ran for CakePHP, I only used bake to extrapolate from an existing database schema created directly in SQL. I kinda liked that approach, but obviously it is a bit less one line of magic than rails generate.
  6. Clearly Symfony is very ambitious and the product of an immense amount of work from the team at SensioLabs. It may just be that the documentation got the better of me, but I couldn’t see that Symfony actually offered any command-line RDT functionality. With a codebase of c. 50MB (with dependencies, but still massive for PHP), there must be some great features in there, but it was just far too heavy for me.
  7. Weighing in at c. 16MB with dependencies for the SkeletonApp, Zend is a bit more lightweight than Symfony. However, perhaps even more so than Symfony, Zend does not appear to offer any command-line RDT functionality, and is really just a set of libraries and a design pattern. Using all of that requires good ol’ fashioned manually created files and hand coding.
  8. As if often the case with Python
  9. Much like Symfony or Zend
  10. For example the Monorail project, which is unaffiliated with the (rather excellent) Mono project.
  11. For example, if the native language is Java, then the files should all be Java / JSP. Strange dialects are often employed in the front–end templating, and less often in config files, but can occur elsewhere in the MVC stack.
  12. I am probably being intolerant or neanderthal about it, but the whole idea of using aspect-oriented programming on top of object oriented programming looks like a recipe for disaster at the outset. The idea that underlying architecture (OO or otherwise) cannot be designed to address the various concerns, and therefore needs to be hacked with AOP just sounds like an excuse for not sorting out the underlying architecture. Roo was exactly such a hacked excuse for us. It was a bad idea.
  13. Some of our problem was related to our own unnecessarily complex data model. We ended up building a bulk data import tool largely outside of the Roo app, and then ultimately we ran some SQL scripts to hoover up the data directly into the datbase.
  14. I am a big fan of the Spring framework in general thanks to Clovis Chapman, but not so for their MVC. Then again, I don’t like JSP either.
  15. Though we have now gotten used to the usual model > dao > service design pattern.

Use SOA responsibly

SOA and I had a mutually abusive relationship.

I naively contorted the idea service oriented architecture, and now I still have the scars from that monster I created.(1)

In order to avoid creating a monster, there are two key questions to ask about a given piece of functionality:(2)

  1. Should it be accessible via HTTP, or can it be native?
  2. If it should be exposed as a RESTful interface, should it be in it’s own webapp, or appended to an existing webapp?

With the benefit of hindsight, I might approach it as follows:(3)

This post is concerned with the reasons to carefully consider the consequences of answering “Yes” to the first question above. Once that point is passed, many of the subsequent questions are driven by my thoughts on WAR Bloat.

REST vs. Native

A good RESTful service is a thing of great beauty. Exposing some functionality as a web service is all too easy with some Java and Spring annotations:

RestResource.java

package com.marketstack.rest;

import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.QueryParam;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Scope;
import org.springframework.stereotype.Component;


/**
 * This class exposes an HTTP service and on the face of it
 * looks relatively clean
 */
@Path("/rest")
@Component
@Scope("request")
public class RestResource{

@Autowired
private ServiceConnector serviceConnector;

@GET
@Path("/userCount")
public String userCount(@QueryParam("foo") String foo) throws JSONException {
  return serviceConnector.getUserCount(foo);
}//end method userCount

}//end class
			

So far, so good. But what about consuming the output of a web service? In the trivial example above, our service actually connects to another service. Now we start to require a fair amount of plumbing. For starters, you probably need a helper class to deal with the low level grubbiness of executing the request:(4)

CommunicationService.java

package com.marketstack.rest;

import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.DefaultHttpClient;
import org.apache.http.util.EntityUtils;

/**
 * Class which actually talks to other HTTP resources
 */
public class CommunicationService {

/**
* Method to post some JSON via a POST
*
* @param address      URL address
* @param parameters   Parameters for the URL address (optional)
* @param accept       "application/xml", "application/json", etc.
* @return             The response from the server
*/
public String makeJSONRequest( String address, String parameters, String accept) {
  String result = null;
  if( address.startsWith( "http://" ) || address.startsWith( "https://" ) ) {
    try {
      HttpPost httpPost = new HttpPost( address );
      if( accept != null ) httpPost.setHeader( "Accept", accept );
      httpPost.setHeader( "Content-Type", "application/json");
      if( parameters != null ) httpPost.setEntity(new StringEntity( parameters ));
      DefaultHttpClient httpClient = new DefaultHttpClient();
      LOGGER.info( "This clumsy plumbing is about to try to connect to " + address );
      HttpResponse httpResponse = httpClient.execute( httpPost );
      String jsonResponseString = EntityUtils.toString(httpResponse.getEntity());
      EntityUtils.consume( entity );
      result = jsonResponseString;
    } catch( Exception e ) {
      LOGGER.error("Sorry dude. Connection gymnastics did not go according to plan.");
    }//end try / catch
  }else {
    LOGGER.error( "The URL is a mess you fool" );
  }//end if / else
  return result;
}//end method makeJSONRequest

}//end class			
			

Given that the CommunicationService is fairly low level, you might abstract a bit and proliferate additional classes to connect with specific services. For example:

ServiceConnector.java

package com.marketstack.connectors;

import com.marketstack.rest.UserCountResponseObject;
import com.marketstack.rest.CommunicationService;
import org.springframework.beans.factory.annotation.Value;


/**
 * A class which connects to another HTTP service
 * via the CommunicationService
 */
public class ServiceConnector {
  private static final String ERROR_VALUE = "SOME USELESS ERROR STRING VALUE";
  private CommunicationService comService = new CommunicationService();

  @Value("${stack.connectionPath}")
  private String connectionPath;

  /**
   * @param foo
   *      a random extraneous parameter
   * @return 
   *      the total number of enabled users
   */
  public String getUserCount(String foo) {
    String result;
    try {
      UserCountResponseObject response = new UserCountResponseObject( comService.makeJSONRequest(connectionPath, foo, "application/json"));
      result = response.isOkay() ? response.getCount() + "" : ERROR_VALUE;
      LOGGER.debug( "Result from connector = " + result );
    } catch( Exception e ) {
      LOGGER.error("ServiceConnector bombed out");
      result = ERROR_VALUE;
    }// end try / catch
    return result;
  }//end method getUserCount()
}//end class			
			

Yes, with some nice helper classes, it’s easy to be cavalier and create RESTful methods with reckless abandon, but there is still a lot going on under the hood. The logs provide a hint of the overhead using HTTP instead of native methods:

Service.log

17:33:34  SingleClientConnManager - Get connection for route HttpRoute[{}->http://localhost:8080]
17:33:34  DefaultClientConnectionOperator - Connecting to localhost/127.0.0.1:8080
17:33:34  RequestAddCookies - CookieSpec selected: best-match
17:33:34  DefaultHttpClient - Attempt 1 to execute request
17:33:34  DefaultClientConnection - Sending request: POST /somePath HTTP/1.1
17:33:34  wire - >> "POST /somePath HTTP/1.1[\r][\n]"
17:33:34  wire - >> "Content-Type: application/json[\r][\n]"
17:33:34  wire - >> "Content-Length: 249[\r][\n]"
17:33:34  wire - >> "Host: localhost:8080[\r][\n]"
17:33:34  wire - >> "Connection: Keep-Alive[\r][\n]"
17:33:34  wire - >> "User-Agent: Apache-HttpClient/4.1 (java 1.5)[\r][\n]"
17:33:34  wire - >> "[\r][\n]"
17:33:34  headers - >> POST /somePath HTTP/1.1
17:33:34  headers - >> Content-Type: application/json
17:33:34  headers - >> Content-Length: 249
17:33:34  headers - >> Host: localhost:8080
17:33:34  headers - >> Connection: Keep-Alive
17:33:34  headers - >> User-Agent: Apache-HttpClient/4.1 (java 1.5)
17:33:34  wire - >> "{"foo":"bar"}"
17:33:37  wire - << "HTTP/1.1 200 OK[\r][\n]"
17:33:37  wire - << "Server: Apache-Coyote/1.1[\r][\n]"
17:33:37  wire - << "Content-Type: text/plain;charset=ISO-8859-1[\r][\n]"
17:33:37  wire - << "Content-Length: 39[\r][\n]"
17:33:37  wire - << "Date: Mon, 18 Feb 2013 17:33:37 GMT[\r][\n]"
17:33:37  wire - << "[\r][\n]"
17:33:37  DefaultClientConnection - Receiving response: HTTP/1.1 200 OK
17:33:37  headers - << HTTP/1.1 200 OK
17:33:37  headers - << Server: Apache-Coyote/1.1
17:33:37  headers - << Content-Type: text/plain;charset=ISO-8859-1
17:33:37  headers - << Content-Length: 39
17:33:37  headers - << Date: Mon, 18 Feb 2013 17:33:37 GMT
17:33:37  DefaultHttpClient - Connection can be kept alive indefinitely
17:33:37  wire - << "{"status":"200","executionTimeMs":2466}"
17:33:37  SingleClientConnManager - Releasing connection org.apache.http.impl.conn.SingleClientConnManager$ConnAdapter@5a05c9a8			

In addition to all of the above, I have not detailed what might be in something like the UserCountResponseObject, i.e., the overhead associated with converting the JSON of HttpRequests and HttpResponses to more user-friendly POJOs. I hope to cover that in a subsequent post.

Conclusions:

SOA sounds great in principle. SOA can be great in practice. However, it isn’t idiot proof. Be very disciplined about:

  • When you use it
  • How you implement it

In short, use SOA responsibly.

Notes:

  1. We ended up with service-oriented spaghetti, rather than a service-oriented architecture. That spaghetti was the result of several factors including (but not limited to):
    • first and foremost, lack of experience
    • an attempt to let everyone go away and code in their own corner
    • lack of detailed, explicit coding conventions
    • initial lack of DVCS
    • lack of clarity about hardware configuration
  2. Assuming you already have one abusive relationship named “Java”.
  3. Thanks to the folks over at ProcessingJS, as well as Seb Lee-Delisle and KhanAcademy for bringing it to my attention.
  4. In our infinite wisdom, we had several mutually incompatible flavours of this type of class littered throughout our code.

Don’t Fight the JSON

I confess: even in 2010, I thought XML was cool.

In 2010, I also thought Javascript was a silly spaghetti language to be avoided except where unavoidable. Therefore, JSON as the spawn of that ridiculous language was necessarily not fit for use in development apart from the rarest of cases.

I was wrong.

XML is not cool. JSON obviously wins that contest.

Overhead

The first reason for avoiding XML is coding overhead. Just consider these two bits of data:

XML

<news>
  <story>
    <id>1</id>
    <date>25 December 2012</date>
    <title>The greatest story ever told</title>
    <tags>
      <tag>greatest</tag>
    </tags>
    <description>
	  In the very earliest hours of this morning, 
	  the real Santa Claus was photographed flying on his sleigh 
	  pulled by reindeer over the streets of Slough...
    </description>
  </story>
  <story>
    <id>2</id>
    <date>26 December 2012</date>
    <title>The saddest story ever told</title>
    <tags>
	  <tag>saddest</tag>
    </tags>
    <description>
	  It was today revealed by Scotland Yard, that following 
	  investigations of the greatest story ever told yesterday regarding 
	  Santa Claus, that the photographs were, in fact, forgeries.  
	  Officer Blake issued the following...
    </description>
  </story>
</news>		

JSON

{"news": 
  [
	{"id": 1,
	"date": "25 December 2012",
	"title": "The greatest story ever told",
	"tags": ["greatest"],
	"description": "In the very earliest hours of this morning, 
			the real Santa Claus was photographed flying 
			on his sleigh pulled by reindeer over the streets of Slough..."},
	{"id": 2,
	"date": "26 December 2012",
	"tags": ["saddest"],
	"title": "The saddest story ever told",
	"description": "It was today revealed by Scotland Yard, that following 
			investigations of the greatest story ever told yesterday regarding 
			Santa Claus, that the photographs were, in fact, forgeries.  Officer 
			Blake issued the following..."}
  ]
}		
		

Now I admit that even looking at those two bits of data, I do have a pang of irrational nostalgia for the XML!

However, practicalities intervene.

XML

Just to do something like analyse the descriptions of the stories, I would need most of the plumbing from the SAX piece of the cocktaildemo.(1)

Mercifully, I could probably reuse the GenericParser. I’d then have to modify the OrderHandler to access the description of the story (instead of the drink name). With all of that plumbing in place, I could then simply write a method akin to the parseOrder() method of the OrderService which could be something like:

public String[] parseDescriptions(String xmlString){
  parser.parse(new ByteArrayInputStream(xmlString.getBytes()));
  return handler.getNewsDescription();
}

and somewhere have a method which consumes that output and actually does something!

import foo.bar.NewsService;

public class NewsAnalyser{

  private NewsService service = new NewsService();

  public double getNewsRelevance(String xmlString, Integer subjectId){
    double relevance = 0; 
    String[] finallyGotTheStoryDescription = service.parseDescriptions(xmlString);
    try{
      //some fancy stuff
    }catch(InevitableException e){
      log.error("Bugger me. Got too fancy.  InevitableException: ", e);
      relevance = -1;
    }
    return relevance;
  }//end method
}//end class

NOW, let’s deal with it in JSON. First off, we do not need any of the 3 plumbing classes we tediously started with for the XML processing. We just go straight to:

import org.codehaus.jettison.json.JSONArray;
import org.codehaus.jettison.json.JSONException;
import org.codehaus.jettison.json.JSONObject;

public class NewsAnalyser{
	
  public double getNewsRelevance(String jsonString, Integer subjectId){
    double relevance = 0; 
    try{
      JSONObject newsJson = new JSONObject(jsonString);
      JSONArray newsStories = newsJson.getJSONArray("news");
      for(var i = 0; i < newsStories.length(); i++_{
        String curDescription = newsStories.getString(i);
        try{
          //some fancy stuff
        }catch(InevitableException e){
	      log.error("Too fancy. InevitableException: ", e);
        }//end inner try / catch	
      }//end for
    }catch(JSONException e){
      log.error("Whoa. JSONException: ", e);
	  relevance = -1;
    }//end outer try/catch
      return relevance;
  }//end method
}//end class

Yes, that’s verbose as usual for Java. But at least it’s a sane amount of code!

One point to note is that for Java, although the XML processing has no elegance whatsoever to it, all the classes imported for it are bundled as standard with Java SE. However, for the JSON processing, we had to use a third party library, in this case Jettison.(2) As for processing JSON vs. XML in other languages:

XML JSON
Java SAX, DOM natively(3) since J2EE 1.3 (Sep 2001)(4) No native processing as of Java EE 6(5)
PHP native processing since PHP 4 (May 2000)(6) native processing since 5.2.0 (Nov 2006)
Ruby native processing via REXML since at least 1.8.6 (Mar 2007) processing via json gem since (c. Aug 2009)
Python SAX and DOM natively since v.2.0 (Oct 2000) native since v 2.6 (Oct 2008)
Node / Javascript Decidedly not native, but there are a number of community packages The fount of all JSON
C# / .NET native processing since at least 1.1 (Apr 2003) Limited native support since 3.5 (Apr 2007). Some community alternatives. (7)

Lingua Franca

Everything should be serialised as JSON.

The second reason for avoiding XML is that JSON is de facto (and rightly so) the lingua franca to intermediate the multilingual landscape of today’s software development.

In part because of the coding overheads associated with XML, JSON should be used in an SOA. Even if in your language, XML processing happens to be easier than JSON, you don’t know a priori what language your clients speak. It is safe to assume that consumers of your API will either be pushing that content to a Javascript front end, or will themselves find JSON easier to process.

Conclusions:

Much the same as last time:
  • Avoid pain
  • Be smart
  • Use JSON

Notes:

  1. Care of some vintage code from before that fateful day in April 2009
  2. See Stackoverflow for a lively debate about Java JSON libraries. We liked the idea of GSON but it didn’t work out for a number of reasons. A few other JSON libraries had crept into our code as well, but Jettison worked well for us in the end.
  3. “Natively” used here to refer to functionality which is part of the standard libraries which come with the latest OOTB version of the language.
  4. StAX supported natively since (Java EE 5 May 2006)
  5. JSON support scheduled for Java EE 7
  6. PHP 5 supports both tree and event-based XML processing. The SimpleXML tree-based processing offers comparable simplicity to the elegance of json_encode() and json_decode().
  7. I’ve used Newtonsoft with relative ease on another project.

A Healthy Diet

First off, a few assumptions:
  • we used Java
  • we used Spring rather than the Spring-like functionality in Java EE 6
  • we used Tomcat rather than JBoss or Glassfish
  • we built with Maven

Those decisions were all controversial, but they are the facts.

The key point is that bloated *.war files contributed to MarketStack’s demise. The bloat itself was a contributing cause to our product failure, but more importantly was a symptom of some far more pernicious failings.

Data:

Here is some raw data in tabular form:

*.war Build Size (MB) Build Time (secs)(6) *.java LOC(7)
Today(1) 72.3 201 26,496
Pre-uima(2) 57.7 107 26,283
Pre-tika(3) 27.3 72 21,128
Cocktail(4) 14.3 21 779
Vendor(5) 129.0 789 521,936

Plotting those data points (excluding the Vendor data points which are just too far off the chart):

Observations

I’ve separated the tolerable from the warning zones with power functions which just happen to be reasonably good fits for the data, rather than because they have some intrinsic connection to the variables. As is highlighted in the data, build size and build time are functions not only of lines of code, but of your dependencies. Sometimes you really need a dependency. Tika was a good example for us. However, having that extra 20-30MB lumbering around all the time is (in hindsight) bad design. Our use of tika is a perfect case for accepting another moving piece in the SOA, particularly when that block of code was unlikely to ever be modified other than upgrading to the latest opensource version.

Forms of torture

  • Java is painful. Extended build times cause Java to violate the 8th Amendment.
  • Repeated deployment with Tomcat and blowing up your heap is painful. Bloated WAR files prolong that process and force more frequent restarts.
  • Deploying to production servers when you happen to be away from a civilised internet connection violates the European Convention by itself. Excessively large WAR files transform the experience into something like that scene from Taken.

Conclusions:

If you have to use Java:
  • Avoid pain
  • Be smart
  • Use good basic ingredients
  • keep your WARs are on a healthy diet

Notes:

  1. Today: The code base as it stands today for the core web application
  2. Pre-uima: The code base immediately prior to introducing a Scala-based implementation of Apache-uima-like text processing
  3. Pre-tika: The code base immediately prior to incorporating the whizzy-bang attachment processing capability of Apache-tika
  4. Cocktail: The cocktaildemo as very minimalist reference point for a Spring-based web application which largely mirrors are overall architecture
  5. Vendor: In one of our earliest follies, we purchased a license for some third party software which we thought we could modify for our purposes. Suffice it to say, that vendor’s code exhibits some exemplary anti-patterns.
  6. All build times were calculated on a trusted but somewhat knackered Acer Aspire 4810T with 3GB of RAM running Windows 7
  7. As per CodeAnalyzer for *.java files only, excluding comments and whitespace. Particularly for the Vendor software, looking at *.java files only understates the size of the codebase.
  8. Vendor build time above is with Ant rather than Maven

Java Mistakes 1.0

There were a number of reasons we developed the core of MarketStack in Java.  There are also a number of reasons I would avoid Java where possible.  Leaving that larger debate to one side, some lessons we learned when using Java:

Do not:

1. have production WAR files > ~50MB
2. resist JSON
3. let SOA get out of hand
4. use Roo
5. use Freemarker
6. use multiple libraries for the same thing (e.g., JSON processing)
7. rely upon Hibernate to be efficient

Do:

1. be careful about persistence
2. use Spring responsibly
3. use log4j
4. use SOA
5. have coding conventions
6. use Maven to its fullest potential

I hope to expand upon each of the points above in a bit more detail.

To Blog or Not to Blog?

The answer for the moment is “Yes.”

However, it is a hesitant “Yes.”  I hesitate because I don’t want to do it unless I can say something new.  I don’t want to reiterate the same things said by countless others.  I don’t want to contribute to the often indiscriminate noise of the blogosphere.

Let’s see if I can avoid those pitfalls.