How to make Lucene match all words in query?

I am using Lucene to allow a user to search for words in a large number of documents. Lucene seems to default to returning all documents containing any of the words entered. Is it possible to change this behaviour? I know that '+' can be use to force a term to be included but I would like to make that the default action. Ideally I would like functionality similar to Google's: '-' to exclude words and "abc xyz" to group words. Just to clarify I also thought of inserting '+' into all spaces in

Lucene How to define a boost factor to each term in each document during indexing?

I want to insert another score factor in Lucene's similarity equation. The problem is that I can't just override Similarity class, as it is unaware of the document and terms it is computing scores. For example, in a document with the text below: The cat is in the top of the tree, and he is going to stay there. I have an algorithm of my own, that assigns for each one the terms in this document a score regarding how much each one of them are important to the document as whole. A possible score

What are the limitations of boolean query in Lucene?

I have a requirement to find items in a Lucene index that have two basic criterion: 1. match a specific string called a 'relation' 2. fall within a list of entitlement 'grant groups' An entitlement group defines a subset of items accessible by a member of that group and is much like an authorization role. All documents in the Lucene index have the 'relation' field and, for simplicity sake, one or more 'grant-group' fields. So, for example, a user may search for 'foobar' and that user may be

Lucene Any way to merge two queries in solr?

In my project, we use solr to index a lot of different kind of documents, by example Books and Persons, with some common fields (like the name) and some type-specific fields (like the category, or the group people belong to). We would like to do queries that can find both books and persons, with for each document type some filters applied. Something like: find all Books and Persons with "Jean" in the name and/or content but only Books from category "fiction" and "fantasy" and only Persons fro

Part-of-Speech Tagging with Lucene

I'm building an emotion recognition system for chat applications. In that the core part is finding the verb in the user entered text, which can be done with a part-of-speech tagger. Is it possible to build a part-of-speech tagger with Lucene? If not, what is a good open-source/libre software package or system I can use?

Lucene.net how to order and query KeyValuePair data type

I'm trying to find a solution about indexing-querying following model: Student has many Lessons and for each Lesson there is one grade. However Lesson-Grade is a key value pair. my first question is : how should i index in lucene that keyvalue data? (its like a coordinate may be spatial?) second one is Assume i have a indexed data in lucene. How can i query Students, by Lesson name but ordered by Grade. Student | Lesson | Grade ---------------------------- John | Math

Using Apache Lucene with Infinispan

Does using Infinispan with Lucene improve the performance of Lucene? There is a RAM Directory included in Lucene itself. Is Infinispan better than RAM Directory?

Lucene Analyzer

I have worked with Lucene for indexing documents and providing search among them, however, my work was in English language, but now, I have a project which is Kurdish language, Kurdish language uses some Arabic unicode characters and several other characters, here is Table of Unicode Characters used in Kurdish-Arabic script My question is how to create Analyzer for this language, or can I use Arabic Analyzer for this purpose?

Lucene Lucens best way to do "starts-with" queries

I want to be able to do the following types of queries: The data to index consists of (let's say), music videos where only the title is interesting. I simply want to index these and then create queries for them such that, whatever word or words the user used in the query, the documents containing those words, in that order, at the beginning of the tile will be returned first, followed (in no particular order) by documents containing at least one of the searched words in any position of the titl

Lucene Standard analyzer that doesn't tokenize some words/patterns

So, if suppose there is a line like this: > Mar 14 20:22:41 subdomain.mydomain.colo postfix/smtpd[16862]: NOQUEUE: > reject: RCPT from unknown[1.2.3.4]: 450 4.7.1 Client host rejected: > cannot find your reverse hostname, [5.6.7.8]; from=<erp@misms.net.in> > to=<a@domain1.com> proto=ESMTP helo=<a.domain.net> also > from=<> There are few problems with using standard tokenizer. If I have standard tokenizer, I can't search for from=<>. To do this

avoid indexing documents again Lucene

When I run my program, I index the documents each time I run the program in eclipse. However, I want to just index once. Perhaps by deleting the index after each use, but I don't know how to go about doing that.

Lucene Boosting field prefix match in Elasticsearch

Is there a way to boost the scores of prefix field matches over term matches later in the field? Most of the Elasticsearch/Lucene documentation seems to focus on terms rather than fields. For example, when searching for femal*, I'd like to have Female rank higher than Microscopic examination of specimen from female. Is there a way to do this on the query side or would I need to do something like create a separate field consisting of the the first word?

Elasticsearch Why is queryWeight included for some result scores, but not others, in the same query?

I'm executing a query_string query with one term on multiple fields, _all and tags.name, and trying to understand the scoring. Query: {"query":{"query_string":{"query":"animal","fields":["_all","tags.name"]}}}. Here are the documents returned by the query: Document 1 has an exact match on tags.name, but not on _all. Document 8 has an exact match on both tags.name and on _all. Document 8 should win, and it does, but I'm confused by how the scoring works out. It seems like Document 1 is gettin

Lucene: Difference Field vs. Zone

Hopefully someone can help me shed light on a question: What is the difference between a field and a zone in lucene? I recently read about zones in a book about Information retrieval (here). In that book a zone was described as following: "Zones are similar to fields, except the contents of a zone can be arbitrary free text." Can't a field also have arbitrary free text? On the lucene site (here) the following definition can be found: "Zones are a separate, non-fielded, word list w

Lucene Finding similar documents with Elasticsearch

I'm using ElasticSearch to develop service that will store uploaded files or web pages as attachment (file is one field in document). This part works fine as I can search these files using like_text as input. However, the second part of this service should compare the file that is just uploaded with the existing files in order to find duplicates or very similar files, so it doesn't recommend users same files or same web pages. The problem is that I can't get expected results for documents that a

Lucene - Better ways to store text or index

Basically i am C# developer but in one of my project i need to implement the Lucene search. In short it is a Chat application and i need to find the specific word used by any of the users Now i am able to successfully integrate Lucene.Net in my project. Now my question is what the best ways to store text or create index. Is it better to have one text field (lucene index) with 5000 words 500 fields (lucene index) with 10 words each in it Sorry for if its a wrong terminology but i really

How to use Lucene FastVectorHighlighter on multiple fields?

I've got a basic search working, and I'm highlighting using FastVectorHighlighter. When you ask the highlighter for a "best fragment" you have a few overloads of getBestFragment(s) to choose from, documented here. I'm now using the simplest one, like this: highlightedText = highlighter.getBestFragment(fieldQuery, searcher.getIndexReader(), scoreDoc.doc, "description", 100) So I'm highlighting the match from the "description" field. My query however searches another field, "notes". H

Lucene query in Liferay

I am looking for how to create a lucene query for the condition – className AND (“Apple-Orange” OR “Apple Banana” OR “Apple Shake”) I tried BooleanQuery specialityQuery = BooleanQueryFactoryUtil.create(searchContext); specialityQuery.setQueryConfig(searchContext.getQueryConfig()); specialityQuery.add(contextQuery, BooleanClauseOccur.MUST); BooleanQuery idFilter = BooleanQueryFactoryUtil.create(searchContext); for (String speciality : specialities) { TermQuery termQuery = TermQueryFacto

Lucene Empty bool clause causing zero results being returned in elasticsearch

I wrote an API in ruby that helps clients build the elasticsearch query dsl. The query that is built up contains an empty bool as shown below which is causing problems. With the bool being empty like that, it's causing 0 results to be returned. If I remove the bool I get the expected result. How can I turn this into a match_all without removing that bool? I need to leave the bool there until the next release where I can remove it. If I add a must in the bool by default with a match_all in there,

Lucene Read TermVector of a specific document

Is there a way to read the term vector of a document along with the positions of each term? During the creation of the index I am enabling the positions, freq etc FieldType fieldType = new FieldType(); fieldType.setStoreTermVectors(true); fieldType.setStoreTermVectorPositions(true); fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); fieldType.setStored(true); while reading the search index, I am getting the Termvector usi

Elasticsearch Find matches in text using Elasticsearch

I have an Elasticsearch index of words and word pairs, like: python ruby ruby on rails NLP Javascript Agoraphobia ... And an input text, like: Both Python and Ruby (or Ruby on Rails) could be used for NLP purposes. What I need is to find direct matches of entries from the index in the text. So output should look like: python ruby ruby on rails What is the way to compare the whole index against the text using Elasticsearch?

Elasticsearch Field collapsing on collection

Suppose I've a very simple index. Blog post and blog categories. One blog belong to one or more categories. I want to find for each category the last 3 posts. How can I do this ? I've read about "Field collapsing" here https://www.elastic.co/guide/en/elasticsearch/guide/current/top-hits.html but the example refers to a scalar field, I've a collection. A document could be: { "title" : "My post", "categories" : [{ "tech" => "Technology", "professional" => "Professional"] }, { "title" :

Lucene Sitecore 8.1 : sitecore_fxm_web_index - Root item could not be found

We are using Sitecore 8.1 powered by Lucene index and xDB disabled. We noticed that the CMS CA is quiet slow. While looking at logs noticed a number of error logged below: ManagedPoolThread #4 2015:12:18 10:17:05 ERROR [Index=sitecore_fxm_web_index, Crawler=SitecoreItemCrawler, Database=web] Root item could not be found: /sitecore/system/Marketing Control Panel/fxm/. ManagedPoolThread #15 2015:12:18 10:17:08 ERROR Exception Exception: System.Reflection.TargetInvocationException Message: Exce

Elasticsearch aws cloudsearch/lucene query street names

I uploaded a dataset of addresses to AWS cloudsearch and would need to be able to query the street names in a flexible way: dataset value: { street: "Michael-Bayerhammer-Strasse" } All of the following queries should result in a match: Michael-Gundringer-Strasse Michael-Gundringerstr. Michael-Gundringer-Str. Michael-GundringerStr. etc. I could not find a way to achieve this. Is there a way to do this with cloudsearch/lucene or any other tools? You can test it with my cloudsearch url: This

Elasticsearch Return only exact matches (substrings) in full text search (elasticsearch)

I have an index in elasticsearch with a 'title' field (analyzed string field). If I have the following documents indexed: {title: "Joe Dirt"} {title: "Meet Joe Black"} {title: "Tomorrow Never Dies"} and the search query is "I want to watch the movie Joe Dirt tomorrow" I want to find results where the full title matches as a substring of the search query. If I use a straight match query, all of these documents will be returned because they all match one of the words. I really just want to re

Lucene Cloudant: How to perform wildcard searches on text fields

I have a db in cloudant that looks like this count word 4 a 1 a boy 1 a boy goes i want to run a query like this word: *boy* how do i do this in cloudant? i tried the following but it did not work { "selector": { "word": "*boy*" }, "fields": [ ], "sort": [ { "_id": "asc" } ] }

Elasticsearch Optional fields in elasticsearch

Suppose only 10 out of 1000 documents have a field called limitedEdition, would it add some sort of overhead to the other 990 documents that don't have any values for that field limitedEdition? Would those documents end up having a null value/reference in the elasticsearch indexing, kind of like adding a nullable column in sql? {_id:1,category:[4],feature:[1,2]}, {_id:2,category:[5],feature:[3,5]}, {_id:3,category:[7],feature:[2,4]}, ..... {_id:10,category:[5],limitedEdition:1000} The indexa

Elasticsearch Map multiple words to single word in Lucene SynonymGraphFilter

I'm using lucene 6.4.0. When I map dns to domain name system, I can get the correct query. But when I try to map domain name system to dns, I can't get the correct query. I set parser.setSplitOnWhitespace(false). public class SynonymAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String s, Reader reader) { SynonymMap synonymMap = null; SynonymMap.Builder builder=null; try { addTo(builder,new String[]{"dns"},new S

Elasticsearch Elastic search vs lucene

I know the difference between Elastic Search and Lucene as mentioned here. What is the difference between Lucene and Elasticsearch Apart from scalability, fault tolerance and distributed nature of Elastic Search what is the core difference between the two. Does Elastic search provide any search feature which is better then lucene?

Elasticsearch Elasticsearch Lucene query syntax to add and subtract N minutes with time field?

I am working on grafana dashboard in that, I passing start time and end time from one dashboard to another using template variable. This is how I passing the value var-startTime=2020-07-23T05:07:04Z&var-endTime=2020-07-23T05:11:31Z In another dashboard, I get the variable values and pass to Lucene query like @timestamp:[$startTime TO $endTime] It's working fine. But here I want to get data prior to 15 minutes from start time and 15 minutes later from end time. How could I add and subtract

Elasticsearch Elasticsearch Multi-Match Query with AND operator for the tokens generated by Hyphenation_decompounder token filter

I used hyphenation_decompounder for German language and followed the example as mentioned in the documentation. So far so good. it works!. The text kaffeetasse is tokenized into kaffee and tasse. The concern arose when I use multi-match query for kaffeetasse to find documents where kaffee AND tasse both matches. It seems that multi-match uses OR for the tokens generated by hyphenation_decompounder filter instead of given Operator("AND") in multi-match query. Here is my Test-case Mapp

Lucene Search Error Stack

I am seeing the following error when trying to search using Lucene. (version 1.4.3). Any ideas as to why I could be seeing this and how to fix it? Caused by: java.io.IOException: read past EOF at org.apache.lucene.store.InputStream.refill(InputStream.java:154) at org.apache.lucene.store.InputStream.readByte(InputStream.java:43) at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) at org.apache.lucene.index.FieldInfos.read(FieldInfos.java:195) at org.apache.

Lucene Query Syntax

I'm trying to use Lucene to query a domain that has the following structure Student 1-------* Attendance *---------1 Course The data in the domain is summarised below Course.name Attendance.mandatory Student.name ------------------------------------------------- cooking N Bob art Y Bob If I execute the query "courseName:cooking AND mandatory:Y" it returns Bob, because Bob is attending the cooking course, and Bob is also attendin

How often to call commit on an offline Solr/Lucene index?

I know there have been some semi-similar questions, but in this case, I am building an index which is offline, until build is complete. I am building from scratch two cores, one has about 300k records with alot of citation information and large blocks of full text (this is the document index) and another core which has about 6.6 Million records, with full text (this is the page index). Given this index is being built offline, the only real performance issue is speed of building. Noone should

Lucene Solr: what are the default values for fields which does not have a default value explicitly set?

I'm working with Solr's schema.xml, and I know that I can use the 'default' attribute to specify a default value which is to be used if a value for a given field has not been provided. However, say that I choose not to set the 'default' attribute, which default value will Solr then fall back to? I would think that the field type which I've used for the given field would have a default value which would be used, but I have had not success finding any details about this. Alternatively, I'd think

maximum chars in Solr/lucene term for fuzzy match

I am trying to experiment fuzzy match with Solr. In my document indexed first_name field I mentioned as "MYNEWORGANIZATION20SEP2011" - actually the word was "My New Organization 20-Sep-2011" but I removed spaces and other chars. Now above word (without spaces) if I search directly as query "MYNEWORGANIZATION20SEP2011" Solr is resulting 1 result as above document ID, perfect ! But if I trim two chars from this string and in query if I provide "MYNEWORGANIZATION20SEP20~0.8", I am getting 0 resu

Lucene Sorting Results of a Multi-valued Faceted Search

Using bobo-browse MultiValueFacetHandler to gather multi-valued faceted results, how do I sort these facets by the top-scoring document in each facet? For example, if: Document d1 has facets f1 and f2 and score 3.5 Document d2 has facets f2 and f3 and score 4.7 Document d3 has facets f1 and f3 and score 0.9 Document d4 has facets f2 and f3 and score 2.2 Document d5 has facet f1 and score 3.4 Document d6 has facet f3 and score 5.4 I would expect these results in this order: f3, f2, f1 Tha

Lucene ElasticSearch issue with querying a multi-valued property

I'm running into an issue in that I have a document indexed with elasticsearch and when I query against a multi-valued field, it returns no results. Here is my search: curl -X GET "http://mncoboss13:9200/boss_model_reservations/_search?pretty=true" -d '{"query":{"match_all":{}},"filter":{"and":[{"terms":{"day_plan":["MO"]}}]},"size":100,"from":0}' Results in: { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "h

Lucene Stemming + wildcarding: unexpected effects

I am editing a lucene .net implementation (2.3.2) at work to include stemming and automatic wildcarding (adding of * at the ends of words). I have found that exact words with wildcarding don't work. (so stack* works for stackoverflow, but stackoverflow* does not get a hit), and was wondering what causes this, and how it might be fixed. Thanks in advance. (Also thanks for not asking why I am implementing both automatic wildcarding and stemming.) I am about to make the query always prefix query

Lucene lemmatization

I'm indexing some English texts in a Java application with Lucene, and I need to lemmatization them with Lucene 4_1_0. I've found stemming (PorterStemFilter and SnowballFilter), but it not enough. After lemmatizations I wanted to use a thesaurus for query expansion, does Lucene also include a thesaurus? If it is not possible I will use the StanfordCoreNLP and WordNet instead. Do you think that lemmatization may influence the search using Lucene library? Thanks

Getting the Payload within a search result in Lucene 4.6.x

I already got the payload inserted correctly within a lucene index as such: addDoc(w, "Lucene|1 in|2 Lucene|3 Action", "193398817"); addDoc(w, "Lucene|1 for|2 Dummies", "55320055Z"); addDoc(w, "Managing Gigabytes", "55063554A"); addDoc(w, "The Art|2 of Computer Science Lucene|18", "9900333X"); the number after a word, is the Payload (simplified for what we'll need later on) I'm doing a simple QueryParser on "Lucene in" as a test. as expected, I'm getting 3 documents in the result. When I ge

Lucene Sorting on date field with Sitecore 7 ContentSearch

I'm trying to add a field sort to a date field in a ContentSearch query. I'm able to filter on the index field properly so I'm assuming the field is getting populated with values properly, however, results are not being sorted properly. Any thoughts? Here's the code I'm using to do query: public static IEnumerable<Episode> GetPastEpisodes(Show show, bool includeMostRecent = false, int count = 0) { IEnumerable<Episode> pastEpisodes; using (var context = _index.CreateSearchCon

Lucene index size is too big

I am attempting to build a Lucene index of about 5000 documents, and the index that is being created seems to be getting too large. I would like to know if there is a way to reduce the size of the index. I am using Lucene 4.10, and the documents I want to index are various formats (.docx, .xlsx, .pdf, .rtf, .txt). The size of the directory containing the documents I am indexing is about 1Gb. After indexing 3000/5000 documents, the index size is already 10Gb. I haven't found any helpful informa

Lucene Is the default CQ5 Search Configuration incorrect?

i need to optimize the CQ5 lucene indexing configuration for my application. I want to provide a custom search configuration but i struggle to really understand the default configuration. Source: https://helpx.adobe.com/experience-manager/kb/SearchIndexingConfig.html) First question: Are the "include"-tags used in the default configuration correct? For example: The default configuration uses the tag "include" to include the Property "jcr:content/jcr:lastModified" for the nt:file-Aggregate

Lucene elasticsearch results without _ internal fields

How can I have elasticsearch query return results without _ internal fields such as _index, _type? Reason: For several pages I use AJAX call to get results rather than render the entire webpage on server. But exposing _index & _type internal fields for every document is not only redundant (bandwidth), its also exposing the index and type names (security issue). Please help!

Lucene MusicBrainz API search provides different results from web page

I'm trying to work with MusicBrainz's API but I'm having some issues with the results of the search endpoint. Let's have an example searching for Who's Who? - SIZE020 - Klack (Mix Two) Searching from their site leads to this page, with an almost correct first result (probably because the 100% correct infos are not on the database at all). Using the API leads to different situations which are causing some issues. I made some different attempts with no success, even if I think I know enough of

Elasticsearch How much space does a field take in Elasticsearch index?

Is there an API or some other way to find out how much disk space a field is taking in an Elasticsearch index? We'd like to pare down on some fields but without knowing how much space a field takes, it's shooting in the dark. We use Elasticsearch, but I'm also OK with looking at a single Lucene index (=ES shard) for this information.

Is Escaping multiple characters in lucene possible?

I have a lot of lucene queries that contains a lot of characters with special meaning like colons, slashes, quotation marks, etc. I am aware that it is possible to escape single character by using '\', but is it possible to enclose whole sentence into something to be matched exactly in a query, without any of the symbols being interpreted? Thanks.

  1    2   3   4   5   6  ... 下一页 最后一页 共 13 页