|
Every word from 4488 magazines!
Magazine scans came from a multitude of locations, possible even random TOSEC downloads by myself. In addition, my own specific collection focusing on Norwegian magazines was included as well.
All words that were detectable after doing a OCR's on every PDF was extracted, indexed and duplicated words on same page removed. The database allows you to search for magazine name, source (pdf filename), word on page, and page number.
It should be noted that the database comes in two variations (single word on page and 4-word as-is, sentence like from page) which are searchable or viewable. The word or 4-word sentence given back during search hits appears on the page indicated, so if you have the magazine yourself (either physical paper or PDF version), you can read the rest in that naturally :-)
Due to the variation of both pdf scan quality, OCR detection success and the actual physical paper/artwork/font design does not always gives out all words naturally seen by a human eye. OCR is what it is and do the job as best as possible. Naturally words may come out garbled, malformed or even missing, so my tip is to enable PARTIAL SEARCH in the search engine to allow a wider hit ratio, but naturally that would produce a much larger search hit response. You have been warned :-)
Statistics - Single Word Database:
--------------------------------------------
Locations = 4488 (0.00mill)
Rows = 97931839 (97.93mill)
Total Words = 195912654 (195.91mill)
Statistics - 4-word (Sentence) Database:
--------------------------------------------
Locations = 4477 (0.00mill)
Rows = 38766775 (38.77mill)
Total Words = 193882737 (193.88mill)
|