SBERT isn't trained on this type of archaic English and you can see it failing. It needs to be fine tuned or you should use a modern Bible translation.
A clear example is the query: "homosexuality"
This returns:
> James 2:3 - And ye have respect to him that weareth the gay clothing, and say unto him, Sit thou here in a good place; and say to the poor, Stand thou there, or sit here under my footstool
It's clearly seeing "gay" but is unaware that the meaning has changed.
A classic issue when applying a ML model out of domain.
Hey, creator of the Bible Semantic Search app here. I 100% agree with you. I hacked together this prototype for the purposes of learning the Pinecone API rather than fine-tuning a language model, but I'm still pleasantly surprised by the quality of the search results despite all the shortcomings you mentioned. The results aren't great, but they're decent. I'm surprised how well SBERT works out of the box considering I haven't done any fine-tuning whatsoever, and like you said, it doesn't fully understand the KJV's archaic writing style. Switching to a version that uses more modern English like the NIV is trivial so maybe I'll do that.
I recommend the WEB as a public domain translation. The likes of the NIV and ESV will be troublesome for licensing reasons. The WEB isn’t my favourite translation (I grew up with the RSV and prefer the RSV or ESV), a bit clumsier than the ASV from which it is mostly springs (with changes both missing and extraneous), but it’s public domain.
Go with the ESV or NASB as these are word for word translations similar to the KJV. There's also the NKJV that uses modern english. Can use modern translations on the backend and still display the KJV verse.
Do those licenses apply to using them on the backend, though? Assuming that the site displays the classic verse, and not the copyrighted verse, as GP prescribes.
What makes the Bible unique in this context is that there are huge amount of resources that all reference the same basic structure.
One of the things that makes KJV version useful for scholars is that since it's been the "standard" for hundreds of years a great deal of other work references its structure. People use it because of this. It's less to do with it being a fabulously accurate translation or whatever. It's just newer bibles don't have this wealth of history and documentation that references it and much of them are pretty expensive to license, while KJV is public domain.
For example "Strong's Exhaustive Concordance". If you get a version of KJV with "Strong's Numbers" you can cross reference words and phrases in their original languages (greek/hewbrew/etc). This way students can understand some of the original meanings that go into difficult or disputed passages.
Also there is a large number of commentaries of all sorts of different types that reference specific passages.
Besides that KJV is just mostly valued in Protestant Christian dialects of Christianity. Other Christian religions such as various versions of Eastern Orthodox have different numbers of books or will arrange things in different orders. There have been different attempts to past scholars to arrange things in more chronological order, too.
This makes the Bible fairly unique when it comes to literature. Each verse of text can have dozens of different "back links" and "references".
so if somebody searches for the subject "Homosexuality" it will get hits in various commentaries. Those commentaries all directly reference verses in the Bible.
So you could show the version found and why it was selected. That way a reader would be shown "These authors think this verse is has to do with homosexuality" and they could click through and find out the justification for this, different translations, what those translations are likely based on, what other Christian sects feel this verse means, and so on and so forth.
I don't know if it would be useful for you, but there is a "Sword Project" that collects and cross references different Bibles and bible resources as well as tools.
> Switching to a version that uses more modern English like the NIV is trivial so maybe I'll do that.
ESV or CSB, please...
How hard would it be to train it on the range of common modern translations? It would be an interesting stress test of the models to see how close searches in different translations are - they're theoretically all communicating the same thing, with different styles and emphasis, but I'd expect a lot of that to fall out in the semantic search (at least, if it were working properly).
You could grab a range of English translations ranging from "very literal" to "thought for thought" (The Message applies here, and I'm not even sure it's thought for thought), do various searches, and see what the overlap in results is.
In any case, very neat project... concept. :/ It appears to have fallen over, all I get is "Please wait..." when I try to access it. Even without my usual web filters interfering. I think.
The range[1] you speak of could definitely be interesting to add. There can already be quite a difference between translations on this spectrum which could help this tool pick up even more possibilities for particular meanings (particularly on the extreme end of functional/paraphrasing like The Message you mentioned or The Passion Translation). It's quite fascinating looking at a verse across this spectrum and seeing what the different translators chose to focus on or "pull out" of the underlying Greek/Hebrew.
I haven't personally read much of ESV or CSB, just curious, why the preference for those?
Some modern translations might have copyright rules preventing this without express permission (which may or may not be difficult to acquire, obviously other sites/apps have done it). They all have different rules and limitations but I remember reading that the NET Bible was specifically built for more permissive copyright requirements in the Internet age.
https://ebible.org/t4t/ might work? It's available in ex. https://andbible.github.io/ which bodes well for its licensing. It's also in conventional English (minimal jargon), which is probably a positive.
I just looked up NET because I was thinking the same thing, but it seems their free license is strictly for print versions and <1000 copies. Also if OP cares, its noncommercial. Their site has an email address posted for licensing inquiries though so you could ask!
As someone else pointed out, the WEB translation is public domain so it might be a better option.
sblgnt is a decent basis for literary analysis but it’s NT only. the non-trivial part of your work is to navigate the oddly byzantine bible licensing considerations
Couldn't they search in a modern translation and just refer back to the ancient one? Seems like the way bible verses are split up should remove the ability of translation to move anything around much.
Ah, that's a good idea. I was considering just fully replacing the KJV with the NIV, but for people who want the KJV, having a mapping between the two versions would work. Do the verses between the two versions map each other exactly? Like they cover the same exact thing, just written in a different style?
To do this properly, you have to map versifications, because some systems use different conventions. As the most significant and pervasive example, some systems treat the title of psalms as verse one (French generally does this), while others treat it as separate (verse zero, effectively; English generally does this), so if you just reuse numbers across such transitions you’ll get consistent off-by-one errors. There are quite a few other similar off-by-one errors, and occasionally more, throughout.
https://wiki.crosswire.org/Alternate_Versification is a decent starting point for looking into the topic, with SWORD’s canon_.h files fairly tolerable for showing the number of verses in each chapter. Unfortunately, SWORD has never gone as far as doing proper mapping* between versifications for some reason—they have some basic mapping somewhere or other, but I can’t remember offhand where or what it is, as I only briefly looked into it five years or so ago.
The most significant differences occur when you switch languages (there are quite a few differences if you switch from English to French or to many Indic languages), but there may be some differences between translations within a language too, e.g. SWORD’s NRSV versification has an extra verse in 3 John and Revelation 12 compared to its KJV versification.
Just remember the NIV translation of The Bible is property of Harper Collins, who is also amusingly the publisher of many tabloids and The Satanic Bible.
I used to distribute a BibleGateway-ripped copy of NIV as a plugin for the open source GnomeSword project and managed to upset both Harper Collins and the open source scholar community. It was absurd and I refused to stop. I dared them to sue me for sharing the Bible as open source... the headlines would write themselves.
FWIIW they never took the bait, but expect empty threat Cease and Desist letters from the Harper Collins legal team.
I thought NIV was Zondervan, bought (as you intimate) by News International (Rupert Murdoch). What I hadn't realised is Harper Collins owns Zondervan and Murdoch controls that.
This is pretty good! I entered, "Every human thought is only evil," and got, among others, the verse I hoped for, Genesis 6:5: "And God saw that the wickedness of man was great in the earth, and that every imagination of the thoughts of his heart was only evil continually." "David cried" got me some unexpected results for "the Son of David", but also the verse about David and Jonathan I expected, and a couple others that were a very good fit.
urllib3.exceptions.ProtocolError: This app has encountered an error. The original error message is redacted to prevent data leaks. Full error details have been recorded in the logs (if you're on Streamlit Cloud, click on 'Manage app' in the lower right of your app).
Traceback:
File "/home/appuser/venv/lib/python3.8/site-packages/streamlit/scriptrunner/script_runner.py", line 475, in _run_script
exec(code, module.__dict__)
File "/app/bible-semantic-search/app.py", line 50, in <module>
query_results = controller.query(
File "/app/bible-semantic-search/controller.py", line 127, in query
results = self.index.query(query_emb, top_k, namespace)
File "/app/bible-semantic-search/pinecone_index.py", line 63, in query
return self.index.query(
I'm having the same issue. My searched phrase was two words very simple "Judge not" for reference. I'll check back later, curious how this looks and behaves overall.
> This is a Streamlit app I prototyped for performing semantic search on the King James Bible. It conducts full text search as well as semantic search, which is useful for surfacing passages that are similar in meaning to the query, even if the passages don't explicitly contain the query keyword(s). Suppose you wanted to bring up all verses that reference the infamous snake that tempted Eve. In a traditional keyword search system, searching for 'snake' wouldn't yield any results because the KJV uses the term 'serpent'. A semantic search system would take that 'snake' query and retrieve the relevant verses that contain 'serpent' as well as similar verses like ones about reptiles.
> Under the hood, I've generated vector embeddings of every verse in the Bible using SBERT (https://www.sbert.net/), and stored those embeddings in a vector database called Pinecone (https://www.pinecone.io). Every time you submit a query, it's converted to its vector representation using SBERT. That query vector is then sent to Pinecone, which performs an Approximate Nearest Neighbor (https://www.pinecone.io/learn/what-is-similarity-search/) search, retrieving the top n verses that are the most semantically similar to our query. The verses returned are ranked in order of most to least similar.
(Full disclosure: I work for Pinecone, but I have no connection to this demo.)
In terms of performance, I'm not sure, although I'm willing to bet on Pinecone as they have their own proprietary indexing algorithm. I can speak in terms of developer experience though, as I did use Pinecone to build this app. One obvious but important difference is that with pgvector, you need to spin up and self host a postgres server on your own. Pinecone provides a batteries-included, fully managed experience. Also, it seems that a lot of important operations in pgvector are conducted with SQL statements. With Pinecone, you don't need to deal with SQL or with any ORMs.
In terms of this project, I would not have chosen pgvector. I don't want to deal with the PITA that comes with manually setting up and self hosting a vector database. When I'm building a demo or a prototype, I care about speed of development, which lets me more effectively explore the possibility space of whatever I'm building. I'm not a database admin, so when I deploy the project, I don't want to do database admin tasks. The ease of use of Pinecone's API lets me move fast, and it was very intuitive to learn. One downside of Pinecone is that although they have a generous free tier, their next highest tier is $50/month (and that's the low end of that tier). This is unfortunate for solo devs like myself who are likely to graduate from the free tier and would be willing to pay a bit more for a higher tier, but find the $50/month plan to be overkill. I think they're trying to target startups with that $50/month plan.
Same general difference: Pinecone is SaaS that's ready to go with a few API calls, while Milvus an open-source tool that requires work to set up, scale, and maintain.
While serving a mission for my church, a fellow member bought me a copy of “Strongs Exhaustive Concordance of the Bible”. I found it really fascinating to flip through correlated terms and be able to draw conceptual lines between them. It was the first time I’d ever come across something like that. It strikes me as something that is relatively rare simply because what the Bible is, in terms of popularity and commonality across the western world.
I've got a copy and still use it sometimes -- being able to immediately see nearby words (like "glad", "gladly", "gladness") is quite useful, which you'd miss just searching for "glad". Being able to see all uses on a hardcopy in a fixed position that doesn't scroll is helpful too, I find it harder to rationalise big lists when they're scrollable.
(I also use it to explain type checking to friends, given the number of times I've looked up a Hebrew word in the Greek dictionary, or vice versa).
Very cool! I could see myself really using this. I just wish it wasn't so dependent on SaaS - it tends to make things unreliable, and can disappear at any time.
A query "what is the purpose of life?" returns decent results with:
- Precision@2 = 50%
- Precision@5 = 20%
- Precision@10 = 33%
Relevant results:
2) Philippians 1:21 - For to me to live is Christ, and to die is gain.
6) Ecclesiastes 3:13 - And also that every man should eat and drink, and enjoy the good of all his labour, it is the gift of God.
10) Ecclesiastes 2:17 - Therefore I hated life; because the work that is wrought under the sun is grievous unto me: for all is vanity and vexation of spirit.
Would have loved this back in my religious days, or in the brief period of time I was interested in criticizing religion. These days I'm not really interested at all in the Bible, but this is still a pretty cool idea. A little bit of an archaic choice to search the KJV though. Maybe it was chosen due to copyright issues.
The NIV changed their interpretive philosophy around the turn of the century to favor gender neutral language. Most churches switched to ESV at that point.
Must be for Mormons. They only read the King James Version because otherwise the Book of Mormon would sound super weird seeing as how it’s the worlds most popular Bible fanfic. Source: technically a Mormon.
Not really. It's true that KJV is the default for English speakers, but less than half of the church's members are native English speakers. Even among those that speak English, many study the NIV and ESV, for example. It's not even that uncommon to hear non-KJV quotes in GenCon for that matter.
Hey today I learned, thanks! I catch maybe five minutes of General Conference, my wife’s more likely to watch it than I am. I was also very confused about what a tabletop gaming convention founded by Gary Gygax (of Dungeons and Dragons fame) had to do with Mormons at first hahah.
For example I expected Luke 9:13 as a result for the phrase "you feed them" or "you give them something to eat." The King James reads "give ye them to eat" but the app never found it.
Is this site hijacking just my history or something (can't get back to hn after opening it all the history is spammed with their own domain). Poor quality site this behavior should be discouraged.
Hmm. Searching “baby” returns nothing but references to Babylon. Not sure why this was my first and only search, less clear why it is doing partial text matching.
The app performs full text search as well as semantic search. The full text search results are presented first, so if you scroll down to the bottom of the page you should see the semantically related results under the header "Semantic Search Results"
I figured it was the quintessential version of the Bible ¯\_(ツ)_/¯. I'm not religious so cut me some slack haha. Looking back, I should've used a version using modern English.
KJV is still wildly popular and some sects even regard it as an inspired translation. Never heard that about anything else.
NKJV is more readable for people who grew up reading blog posts instead of old books and generally retains the poetry of the KJV, but it's under copyright. I think it would be allowed for a use like this though, because they explicitly allow (as do most translations) quoting up to a certain number of verses.
My understanding of the Eastern Orthodox Church is that they see the Septuagint as an inspired translation of the Old Testament (from Hebrew to Koine Greek).
Is just two hours on HN enough to kill a Streamlit app? Or is this running into limits of a certain plan level? Or is the problem with the code itself?
Isaiah 1:15 - And when ye spread forth your hands, I will hide mine eyes from you: yea, when ye make many prayers, I will not hear: your hands are full of blood.
AttributeError: This app has encountered an error. The original error message is redacted to prevent data leaks. Full error details have been recorded in the logs (if you're on Streamlit Cloud, click on 'Manage app' in the lower right of your app).
Traceback:
File "/home/appuser/venv/lib/python3.8/site-packages/streamlit/scriptrunner/script_runner.py", line 475, in _run_script
exec(code, module.__dict__)
File "/app/bible-semantic-search/app.py", line 11, in <module>
index = PineconeIndex("qa-index")
File "/app/bible-semantic-search/pinecone_index.py", line 16, in __init__
self.index = self.connect_to_index(index_name)
File "/app/bible-semantic-search/pinecone_index.py", line 30, in connect_to_index
index = pinecone.Index(index_name)
File "/home/appuser/venv/lib/python3.8/site-packages/pinecone/index.py", line 34, in __init__
openapi_client_config.api_key = openapi_client_config.api_key or {}
A clear example is the query: "homosexuality"
This returns:
> James 2:3 - And ye have respect to him that weareth the gay clothing, and say unto him, Sit thou here in a good place; and say to the poor, Stand thou there, or sit here under my footstool
It's clearly seeing "gay" but is unaware that the meaning has changed.
A classic issue when applying a ML model out of domain.