Lingo4G or Lingo3G?
Use Lingo4G for collections of any size available in advance for indexing.
Lingo3G is for small and medium amounts of ad-hoc data.
|Primary use case|
Analysis of small and large sets of documents where the collection a as a whole can be accessed for indexing.
Good examples of such collections are the PubMed medical articles or Wikipedia data dumps, which can be downloaded for off-line processing.
Clustering of small and medium collections of ad-hoc, previously unseen textual content.
Typical example of such content is search results from public search engines. While specific search results can be accessed on demand, complete contents of all documents in the collection is not available.
Stateful. Lingo4G first needs to create a persistent index of all your documents. Once the index is created, you can request Lingo4G to analyze the whole indexed collection or different subsets of it. Including previously unseen documents in analyses requires re-indexing.
The split into two phases allows Lingo4G to perform the expensive operations, such as tokenization of documents, only once during indexing. The index makes it then possible to analyze the whole collection or any subset of it in near-real-time.
Stateless. Lingo3G performs processing in one step, all documents provided on input will be immediately processed and disposed of.
The stateless paradigm makes it possible to process arbitrary and previously unseen documents. The cost is that the time-consuming operations, such as tokenization, will be repeated for each set of documents being processed.
|Maximum size of data analyzed at once|
Millions of documents. The practical limit for the total size of the analyzed collection is about 100 GB of text.
Thousands of documents. The practical limit for Lingo3G's in-memory processing model is about 10 MB of text.
Dependent on the specific analysis task. Lingo4G does not need to keep the text being analysed in-memory.
For example, the memory requirements for label clustering depend on the number of selected labels, not the number of documents in scope. Therefore, label clustering can be applied to gigabytes of text and hundreds of thousands of documents while keeping memory requirements limited.
Increasing with the the number and size of documents. Lingo3G keeps the input documents and their internal representation in-memory for the duration of processing.
To ensure high-speed real-time processing, Lingo3G's internal representation of documents takes several times the size of the raw input text. Therefore, Lingo3G is not recommended for clustering more of about 10 MB of text at once.
Higher-level problems, such as finding content-wise similar documents or k-nearest-neighbors classification, can be solved using the base result facets.
Lingo4G also can highlight occurrences of labels in the original text of documents.
Lingo3G results are conceptually similar to Lingo4G topic extraction. Conventional document clustering is not available in Lingo3G.
|APIs and integration|
Lingo4G comes with a standalone server application that exposes an HTTP/REST API. The server can run in any environment supporting Java 8 or later.
Questions and answers
I have a 50 GB collection of patent applications. I want to extract topics for up to 1000-document subsets of that collection. Which engine is better in this case?
Both Lingo4G and Lingo3G are suitable in this case.
Lingo4G will create an index for your collection, which will give you the ability not only to extract topics, but also to search your collection, perform conventional document clustering and complete more advanced text mining tasks, such as classification.
In case of Lingo3G, the easiest way to process your collection will be to index it in, for example, Apache Solr or Elasticsearch. Then, depending on the architecture of your software, you would either fetch documents from your search engine and feed them to Lingo3G Java/C# API for clustering, or have your search results clustered directly inside the search engine through Solr or Elasticsearch plugins.
I have a fast-growing collection of social conversations and I'd like to detect emerging topics in that data. Is Lingo4G or Lingo3G suitable for the task?
While neither Lingo4G nor Lingo3G have a specific feature of emerging topic discovery, they will likely be useful components of such systems.
Currently, Lingo4G does not support incremental indexing, so adding new conversations to the index will require re-indexing of the whole collection. This may not be a problem if it is acceptable for the new data to enter analysis on, for example, daily basis.
If it is required for the new content to be considered immediately after it arrives, the only applicable engine is currently Lingo3G, possibly coupled with Apache Solr or Elasticsearch.
I have a collection of 1 million research papers and I'd like to extract topics for up to 200k-document subsets at once. Which engine can handle that task?
Only Lingo4G can handle topic extraction for such large subsets of your collection. Its on-disk index makes it possible to analyze hundreds of thousands of documents with limited memory footprint.
Lingo3G with its in-memory processing model is not suitable for inputs exceeding 10 MB of text.
I'd like to cluster search results fetched from public search engines. Is Lingo4G suitable for this?
No, the only engine that fits here is Lingo3G.
Before clustering, Lingo4G needs to access the entire collection of documents in advance to create an index. In case of public search engines, the underlying collection is not usually available as a whole, which rules out Lingo4G*.
Lingo3G, on the other hand, was designed to cluster data in one pass — you'd simply fetch the text of search results from the search engine, feed the text to Lingo3G and receive the search results clustered.*) One could imagine creating a temporary throwaway index just for the search results received from the search engine, which would make Lingo4G applicable as well. Currently, however, Lingo4G does not specifically support creating temporary indices, which makes the workaround cumbersome. Also, Lingo3G is optimized to perform real-time clustering of small data sets, so clustering of search results in Lingo3G will be much faster.