gridpp logo

Camtology is a joint venture between the companies imense and iLexIR and academics from the Univerties of Cambridge and Birmingham. It is led by Mike Hobson, a cosmologist at the Cavendish Laboratory. Part of its funding comes from the PIPSS programme which has funded projects with both imense and iLexIR, with Andy Parker as the PI.

Camtology uses the grid to provide massive computing power for training its search algorithms. The PIPSS projects with imense have demonstrated the commercial potential for grid computing. The company uses its proprietary software to search the content of images, based on classifiers. These were trained on a large corpus of images, and provided sufficent proof of the concept for the company to raise its first venture capital. The second project has scaled up the processing to a much larger corpus of images.

The project with iLexIR aims to build intelligent search services for science, based on context dependent searches within publications. The project is in collaboration with imense, which means that we can search documents for both text and images. In addition we have started discussing a collaboration with the CERN library, to make the technology available for CDS, Spires and arXiv.

One important tool in text processing is a database of N-grams. These are N-element long sub-sequences from text, and their statistical properties are used in understanding the text. The quality of the N-gram database is crucial to the success of the method. The largest N-gram corpus is currently held by Google.

We would like to process free text from the web. Ideally, to do this, we would have an N-gram corpus of all the free text available on the web (in English). We are testing the company's text processing pipeline at it looks feasible to gather an N-gram corpus which would rival or exceed Google using the grid. In order to acheive this, we will submit jobs to grid which will follow links to free text sources, and process them through the analysis pipeline to derive the N-gram data. This will then be shipped to the Cambridge SE for analysis. Each job will inspect enough links to run for a few hours, in order to provide efficient use of the CPU. The production runs will be managed using Ganga, by Karl Harrison and Mark Slater at Birmingham. The VO has very few members, and only performs managed production runs on the grid. Final analysis is performed on Camtology resources at Cambridge.

Valid XHTML 1.0!
Page maintained by Andy Parker.