Can errors in specimen or observation point data be detected by comparison with verbal localities?
I wish that I could use localities mentioned in text from EOL distribution chapters, and localities and taxon names from BHL pages to obtain the following data: Verbal localities for species occurrence, indexed by a gazetteer to coordinate localities, in order to answer the following biological research questions:
Can errors in specimen or observation point data be detected by comparison with verbal localities? What are the patterns of different error types (sign errors, transposition of coordinates, transposition of digits...)?
one major problem to solve here is to find the present equivalents of old toponyms. Some online databases could help (e.g. http://www.geonames.org/). The main issue is whether such databases are scientifically reliable and if they can be cited.
From the author: In addition to the EOL and BHL data, a test dataset of points would be needed. Any large museum collection or aggregator such as GBIF or OBIS, or a direct observation reporting platform would do. It would need to be possible to refer to the source of the data upstream of the transcription errors (specimen labels, original reporters of the occurrence) in order to assess accuracy of the error detection. In addition, a good gazetteer would be needed, to provide the basis for text searches for the verbal localities. A test data set should be selected based upon the availability of such a gazetteer for the region of interest. OCR quality could prove to be a challenge for BHL text; near matches should possibly be considered.