Questions & answers
(For questions not covered here and in Lingo4G reference, please do get in touch.)Licensing
How is Lingo4G licensed? What is the cost of a Lingo4G license? Are there any subscription fees? Can I get a trial license? How many collections can I process on one server? Is the total number of documents in the index limited? What kind of limits can my Lingo4G license include? I have a Lingo3G license, will I receive Lingo4G as an upgrade?Evaluation
Can I get a trial license? What limitations will my trial license have? My trial license expired. Can I have some more time?Ordering
Can I pay by credit card? Can I get a refund? Can I get a discount? Can I get a formal quote? Do you accept Purchase Orders? Can I purchase training services?Technical issues
Can I add, update or remove documents from an existing Lingo4G index? Does Lingo4G come as an Elasticsearch or Solr plugin? Is Lingo4G an end-user product? What are the system requirements for Lingo4G? What does near real-time processing mean? What is the largest collection Lingo4G can handle? Which languages does Lingo4G support? Can I use the dotAtlas map visualization component in my application?Use cases
What are the applications of Lingo4G? I'd like to cluster search results fetched from public search engines. Is Lingo4G suitable for this? I have a fast-growing collection of social conversations and I'd like to detect emerging topics in that data. Is Lingo4G or Lingo3G suitable for the task? I have a 50 GB collection of patent applications. I want to extract topics for up to 1000-document subsets of that collection. Which engine is better in this case? I have a collection of 1 million research papers and I'd like to extract topics for up to 200k-document subsets at once. Which engine can handle that task?Licensing
How is Lingo4G licensed?
We require one Lingo4G license per one physical or virtual server that runs Lingo4G binaries, regardless of the number of cores on the server, the number of users and number of collections handled by the server.
For large-scale or non-typical deployment scenarios, such as OEM distribution, please get in touch.
What is the cost of a Lingo4G license?
The cost of a license depends on the edition, please contact us for a quote.
Are there any subscription fees?
No. Both one-year and perpetual server licenses are available for a one-time licensing fee.
Can I get a trial license?
Absolutely! Please get in touch for a free evaluation package.
How many collections can I process on one server?
There are no restrictions on the number of Lingo4G instances running on one physical or virtual server. The only limit may be the capacity of the server, including RAM size, disk space and the number of CPUs.
Is the total number of documents in the index limited?
No. Regardless of your Lingo4G edition, there will be no license-enforced limits on the total number of documents in Lingo4G index.
What kind of limits can my Lingo4G license include?
Depending on your Lingo4G edition, your license file may include two limits:
Maximum total size of indexed documents, defined by the
max-indexed-content-length
attribute of your license file. The limit restricts the maximum total size of the text indexed for the purposes of Lingo4G analysis. Text stored in the index only for literal retrieval is not counted towards the limit.Maximum number of documents analyzed in one request, defined by the
max-documents-in-scope
attribute of your license file. The limit defines the maximum size of the subset of your collection Lingo4G will analyze at a time. If the number of documents matching the subset definition query exceeds the limit, Lingo4G will ignore the lowest-scoring documents.For example, the Medium Server edition will allow you to index all 240k (spanning about 3GB) questions from SuperUser.com and will cluster or extract topics for at most 50k questions at a time. If you request processing of a subset larger than 50k questions, Lingo4G will analyze the top 50k questions matching the subset definition query.
The above limits are enforced for each Lingo4G instance separately.
I have a Lingo3G license, will I receive Lingo4G as an upgrade?
No. Lingo3G and Lingo4G are two separate products we intend to offer and maintain independently. Lingo3G will remain an engine for real-time clustering of small and medium collections, while Lingo4G will address clustering of large data sets. Therefore, Lingo4G is not an upgrade to Lingo3G, but a complementary offering.
Having said that, if you would like to switch from Lingo3G to Lingo4G, we offer a license trade-in option and count the initial Lingo3G license purchase fee towards the Lingo4G license fee.
Evaluation
Can I get a trial license?
Absolutely! Please get in touch for a free evaluation package.
What limitations will my trial license have?
The only limitation of the evaluation license is the 2-month validity period. The license does not restrict the functionality of Lingo4G in any way. All features, tools and APIs of the Unlimited Server edition will be there for you to test.
My trial license expired. Can I have some more time?
Absolutely, please contact us for an extension.
Ordering
Can I pay by credit card?
Unfortunately not. We only accept bank / wire transfers (SWIFT, SEPA). Most payments are processed within 1-2 business days.
To aid the transition, we can issue a temporary license to cover the payment processing period. Additionally, we can issue the final license based on a statement confirming that the wire has been initiated.
Can I get a refund?
No. Instead of refunds, we offer free of charge long-term evaluation licenses, so that you can try our software with your own data and decide if it works for you.
Can I get a discount?
We offer discounts for purchases of larger numbers of licenses, please get in touch for details. Alternatively, consider Carrot2, which is an open source text clustering engine.
Can I get a formal quote?
Absolutely. Please contact us and specify the products and number of licenses to quote.
Do you accept Purchase Orders?
Yes, feel free to submit a PO in your standard form.
Can I purchase training services?
Unfortunately not, we don't offer training services. Support for Carrot Search software components is based on product documentation, code examples, and direct help through e-mail and teleconferencing systems.
Technical issues
Can I add, update or remove documents from an existing Lingo4G index?
Yes. With Lingo4G version 1.7.0 or later you can add, update and delete without re-indexing the whole collection.
Does Lingo4G come as an Elasticsearch or Solr plugin?
No. Lingo4G is standalone software that manges its own index. The index contains Lingo4G-specific data that is not typically present in indices created by enterprise search engines. Therefore, Lingo4G will have to run in parallel to an Elasticsearch or Solr instance you might already have. As a result, the same will be stored separately in the two systems.
Is Lingo4G an end-user product?
No. Lingo4G was designed as a software component rather than a complete end-user application. The primary use case for Lingo4G is integration into larger software suites. Therefore, some programming experience will be required to get started and to use Lingo4G.
Having said that, Lingo4G comes with a GUI application, called Lingo4G Explorer, meant to allow rapid experiments and tuning of topic extraction and clustering. If Lingo4G Explorer meets the needs of your end users, feel free to use it. One thing to bear in mind is that our primary focus is on the underlying topic extraction and clustering algorithms, developing and extending Lingo4G Explorer is a lower priority.
What are the system requirements for Lingo4G?
Lingo4G can run on any platform supporting Java 17 or later. While processing cannot currently be distributed to multiple machines, a high-end workstation with fast SSD storage should be capable of handling collections of several tens of gigabytes. For most data sets not exceeding gigabytes, any computer with 4GB of memory and some disk space will be sufficient. We very much recommend using SSD drives to store Lingo4G indices. Please see the Requirements section of Lingo4G manual for more details.
What does near real-time processing mean?
Near real-time processing means that in most cases Lingo4G will be able to extract topics for a subset of documents in your index within seconds, regardless of the size of the subset. Clustering of tens or hundreds of thousands of documents may take several minutes.
To achieve near real-time performance during analysis, Lingo4G needs to index your collection first. Indexing is a one-time process where the initial text processing (tokenization, label extraction) is applied. On modern hardware, indexing speed is in the order of 200–500 MB per minute.
The performance of both indexing and analysis is crucially dependent on the technology used to store Lingo4G index. To maximize performance and CPU utilization, use fast SSD drives.
What is the largest collection Lingo4G can handle?
On modern hardware with a high-core-count CPU and fast SSD storage, Lingo4G can handle collections reaching hundreds of gigabytes or a terabyte of text.
If you'd like to test Lingo4G on such a large data set, Lingo4G comes with built-in support for indexing patent grant and application documents available from US Patent and Trademark Office. The collection is currently about 500 GB of text.
One important factor to consider is that currently Lingo4G does not offer distributed processing. This means that the maximum reasonable size of the project will be limited by the amount of RAM, disk space and processing power available on a single virtual or physical server.
Which languages does Lingo4G support?
Currently, Lingo4G can only process English text. If you'd like to apply Lingo4G to content written in a different language, please contact us.
Can I use the dotAtlas map visualization component in my application?
The dotAtlas map visualization component shipping with Lingo4G Explorer is currently pre-release software. It's been battle-tested for months by early adopters, but lacks finalized API and documentation.
If you'd like to try integrating dotAtlas into your software, please let us know. We'll be happy to share the pre-release version along with code examples and initial guidance.
We will not charge any extra fees for the pre-release versions of dotAtlas. Once it enters the official product suite, the use of dotAtlas will require a license fee similar to the one that applies for Carrot Search FoamTree.
Use cases
What are the applications of Lingo4G?
The natural use case is exploration of large volumes of human-readable text, such as scientific papers, business or legal documents.
You can combine Lingo4G basic text processing operations to build, for example, the following functionalities:
Recommending tags for a new document based on the tags of the documents existing in the index.
Example-based document search: finding documents similar to the example seed document the user provides, based on keyword or semantic vector similarity.
2D map visualization of documents where semantically-similar documents occupy the same area of the map. Colors of the document marker on the map indicate the cluster to which the document belongs. Salient phrases extracted from documents describe the densely-populated areas of the map.
Finding documents with highly overlapping content, which may suggest the documents are duplicates or plagiarised copies. Lingo4G can perform such a process efficiently for millions of documents at once.
I'd like to cluster search results fetched from public search engines. Is Lingo4G suitable for this?
No, the only engine that fits here is Lingo3G.
Before clustering, Lingo4G needs to access the entire collection of documents in advance to create an index. In case of public search engines, the underlying collection is not usually available as a whole, which rules out Lingo4G*.
Lingo3G, on the other hand, was designed to cluster data in one pass — you'd simply fetch the text of search results from the search engine, feed the text to Lingo3G and receive the search results clustered.
*) One could imagine creating a temporary throwaway index just for the search results received from the search engine, which would make Lingo4G applicable as well. Currently, however, Lingo4G does not specifically support creating temporary indices, which makes the workaround cumbersome. Also, Lingo3G is optimized to perform real-time clustering of small data sets, so clustering of search results in Lingo3G will be much faster.I have a fast-growing collection of social conversations and I'd like to detect emerging topics in that data. Is Lingo4G or Lingo3G suitable for the task?
While neither Lingo4G nor Lingo3G have a specific feature of emerging topic discovery, they will likely be useful components of such systems.
Currently, Lingo4G's fast incremental indexing option does not discover new phrases introduced by new documents. To allow Lingo4G to pick up new phrases from newly-added conversations, features would have to be periodically re-generated based on the whole collection. This may not be a problem if it is acceptable for the new phrases to enter analysis on, for example, daily basis.
If it is required for the new content to be considered immediately after it arrives, the only applicable engine is currently Lingo3G, possibly coupled with Apache Solr or Elasticsearch.
I have a 50 GB collection of patent applications. I want to extract topics for up to 1000-document subsets of that collection. Which engine is better in this case?
Both Lingo4G and Lingo3G are suitable in this case.
Lingo4G will create an index for your collection, which will give you the ability not only to extract topics, but also to search your collection, perform conventional document clustering, produce 2d document maps and complete more advanced text mining tasks, such as classification.
In case of Lingo3G, the easiest way to process your collection would be to index it in, for example, Apache Solr or Elasticsearch. Then, depending on the architecture of your software, you would either fetch documents from your search engine and feed them to Lingo3G Java/C# API for clustering, or have your search results clustered directly inside the search engine through Solr or Elasticsearch plugins.
I have a collection of 1 million research papers and I'd like to extract topics for up to 200k-document subsets at once. Which engine can handle that task?
Only Lingo4G can handle topic extraction for such large subsets of your collection. Its on-disk index makes it possible to analyze hundreds of thousands of documents within a limited memory footprint.
Lingo3G with its in-memory processing model is not suitable for inputs exceeding 10 MB of text.