One thing rarely discussed with the rise of big data is how to do efficient querying, especially at scale.
I've had a ton of data science interviews which ask how to reimplement binary search from scratch (which I would never do on the job), but not anything about how to do efficient JOINs and query nesting.
exactly. Optimising queries becomes really a critical part of the job when you make complex JOINS on millions of records. Just getting the data can take a huge amount of time before you can even consider models.
I’ve been told by IT from numerous organisations that Hadoop will solve all of our team’s query inefficiencies.
Also hence why we introduce new members of the team to learn how to do efficient queries and joins. And spend time upfront to structure their problems.
I work as a biostatistician and I've been tasked recently with querying large databases using SQL, in addition to analysing the data. However, my programming background is very limited and thus I'm sure my queries are very inefficient.
Could you point me to some materials/texts about how to improve querying efficiency for SQL? If it's oriented for beginners then that would be ideal.
I've had a ton of data science interviews which ask how to reimplement binary search from scratch (which I would never do on the job), but not anything about how to do efficient JOINs and query nesting.