No they don't. The size of the dataset is only ~ 500 images. It would interesting to collect a larger dataset to see if they can capture language as well.
Fascinating! Specially given the difficulty of the domain. It seems the network learned the global form and some local graphemic structures but didn't learn how to connect graphemes into words. How much data do think the model needs to make the produced text intelligible?