Pulling a database into a nice set of classes is the new thing. It's called ORM -- object relational mapping. Microsoft's Entity Framework is an ORM for C#. Hibernate is one for Java. The idea is that databases are a pain to use, where-as classes are both easier and more flexible. So of course we let the system translate to and from yucky SQL and database tables. But now I'm wondering, wouldn't learning SQL be easier?
On the "regular programming is just better than SQL" idea: suppose we want to find something complicated, like "parties with an odd number of guests, where none are friends, or friends of friends". In a program, we can grind out each part, bolt on the limitations, and we'll get something ugly but working. Solving that in SQL involves a stroke of inspiration, I assume, and may not be possible (can SQL check even/odd?) As someone who took a database course a long time ago and once taught Access, I can say that's definitely true.
This first complaint is legitimate. Possibly the only one that will be: SELECT isn't
a great word, since it doesn't do the selecting. WHERE selects. SELECT merely tells
us what parts to display. SELECT id FROM people WHERE name<>"unknown";
selects people with names, and displays only their id's. Literally the first thing people see about SQL is the confusing SELECT keyword.
But the real problem isn't looks -- it's the backwards way SQL works. Let's start with Cats and the foods they like. This is what convinces most people that databases are clunky. Here's the way it should look, as a class:
class Cat { int ID; // someone somewhere will probably need this string name="Fluffy"; // the default name List<string> Foods; // <-- a lovely list of this cat's favorite foods } List<Cat> Cats; // create the actual list
Next is the database method. Notice how -- this is crazy -- cats don't say what they eat. We need another table for that:
create table Cats ( id int primary key, name varchar(20) default 'fluffy' /* databases need to limit the size of strings */ ); create table catEats ( food varchar(20), cat_id int /* this must match the id of some Cat */ );
Database rows must be a fixed length. There's no way for each Cat to include a
list of foods (well, sort of). We're forced to make a separate table.
To a programmer, catEats
is a frankenstein where every cat's personal
food lists are jammed together.
insert into cats (id, name) values (101, "Boots"); insert into cats (id, name) values (102, "Max"); ... insert into catEats (food, cat_id) values ("mice", 101); /* Boot likes to eat mice */ insert into catEats (food, cat_id) values ("milk", 101); /* ...and milk */ insert into catEats (food, cat_id) values ("corn", 102); /* Max is corn-fed */ ...
This is how our lovely individual food lists look in a database:
select * from catEats; +--------+--------+ | food | cat_id | +--------+--------+ | mice | 101 | // in a working database, | milk | 101 | // these wouldn't even be grouped by cat_id, | corn | 102 | // making it even harder to read | bread | 103 | | mice | 103 | | apples | 103 | +--------+--------+
As coders, we've had to use this as a hack sometimes. We know the problem:
cat_id
is a contextless number. We need to be told it's into the table
cats
and has to match id
. Bleh. But it turns out
database people know this, and created the Foreign Key concept:
// better definition of catEats table: create table catEats ( food varchar(20), cat_id int, /* this is mark-up for the cat_id field above */ /* tells us that cat_id is meant to match an "id" in the "cats" table: */ foreign key (cat_id) references cats(id) };
Even with the hints from Foreign Key, catEats
as a separate
table still feels wrong.
But the proof is how well it works. Let's find the foods eaten by the third cat:
select food from catEats where cat_id=103; /* assume 103 is the correct cat# */
Not too bad. But let's defeat SQL by asking for foods with 4 or more letters:
select distinct food from catEats where length(food)>3;
What? SQL has a string-length function? I should have known. And it can easily avoid
duplicate items with distinct
? Well, that makes sense, having a single word
for a common operation. But this next one is going to
be a real problem: "names of cats that eat mice". The code version is pretty fun:
// cats that eat mice: for(int c=0; c<Cats.length; c++) { if(Cats[i].Food.contains("mice")) print( Cats[i].name );
Meanwhile, the database version requires a dreaded JOIN. Cat names are in one table, while food names are in the other. We need to manually match them on the shared cat-ID. Here's a basic join to get rows like "Snowball, mice":
select name, food from cats join catEats on cats.id=catEats.cat_id; /* get row with matching cat_id's */
It's a freaking nested loop. We to check every cat against every food row for matching id's. Bleh. The end result doesn't even look good. We're trying to attach foods to their cats, but this is the best we can do:
// cats join catEats on matching id's: Fluffy mice Fluffy catnip Boots mice Boots cheese Boots phone cords ...
Terrible as that is, let's finish it by removing non-mice rows and duplicate cat names. This gets a list of cat names, like "Fluffy, Boots, Henry", but only the mice-eaters:
select distinct name from cats join catEats on cats.id=catEats.cat_id where food="mice"; /* <--- the new line */
The funny thing is, it doesn't work like we think it does. In this case, the WHERE goes 1st, getting rid of non-mice foods from the food list. The JOIN simply replaces a cat_id with the name -- it has preset links to avoid a loop through cats. SQL is a descriptive language -- you describe the result you want, and it figures out the best way of doing it. SQL looks a lot slower than it is -- if you try to read it top-to-bottom as a series of function calls. Which is apparently what I was always trying to do.
There's also another way to write "cats that eat mice" in SQL. That doesn't seem right. Code is flexible. Databases queries aren't -- they're like finding the exact words to cast a magic spell. But this is different and works:
// names of cats that eat mice, version #2: select distinct name from cats, catEats where food="mice" and id=cat_id; /* <-- what? an AND?
It turns out that select from cats, catEats
does a full,
every-pair JOIN (including cats with foods they don't eat). At the end we strip out
all but the mice-eaters. Also food="mice" and id=cat_id
? Who knew SQL had an AND?
Anytime we need to relate cat names to their foods, we'll need that JOIN. It turns out that SQL has functions, except better, which it calls VIEW's. We can write one for that join:
/* creates catOneFood as an alias for the join: */ create view catOneFood as /* <--- catOneFood is the alias */ select cats.id, name, food from cats join catEats on cats.id=catEats.cat_id;
It seems obvious we'd pre-make that view. So "names of cats that eat mice" is merely this:
/* shorter "casts who eat mice": */ select distinct name from catOneFood where food="mice";
As coders, we can still laugh at how a VIEW is a global variable. Silly SQL can't have local variables ... what, it can?!? Arrg. WITH is the same as VIEW, but only for one query. Below, we make a temporary table of long-name cats, then use it to pick out foods they eat:
/* foods eaten by long-name cats, solved by first making a local table: */ with longNameCates as ( /* longNameCats is an alias */ select name, id from cats where length(name)>5 ) /* now the real query: */ select food from longNameCats join catEats /* <-- using the alias */ where longNameCats.id=catEats.cat_id;
If that's not enough, there's even an alternate syntax, which looks pretty good. We can
compute a temporary table anywhere, as long as we use parens and add "as myAlias
" to the end:
/* yet another way to write long nat-name foods: */ select food from (select name, id from cats where length(name)>5) as t2 join catEats on t2.id=catEats.cat_id;
If I didn't know better, I'd say SQL was written in the general spirit of a real programming language.
That length function is making me so mad. As we know, function calls and flexible tests are things code can do, and SQL can't. Let's rummage through SQL's various tests to prove how limited its conditional expressions are.
It has AND and OR. I should have known that. It's also got something named IN, which is merely an "or" shortcut:
// foods fluffy and boots can eat: select distinct food from catOneFood where name="fluffy" or name="boots"; // IN is nothing -- just a shortcut for OR: " " " " where name IN ("fluffy", "boots");
SQL has the full complement of less-than and so on. Well, fine. after I saw
length(name)>5
I assumed length(name)<=3
was also
allowed.
Charmingly, not-equals is the old-fashioned <>. BETWEEN is another
shortcut for silly SQL-users who can't figure out amt>=5 AND and<=10
(which is
also legal sql). They have to use amt BETWEEN 5 and 10
.
But then we come to LIKE. It checks for regular expressions. Regular freaking expressions! I can barely do that in coding languages that aren't Perl. In SQL it's easy:
// (underscore is "any one thing" and % is "any number of anything") select name from cats where name like '_h%'; /* sheena chip zharf */ where name like '%Z'; /* ends with Z */
I don't even know why I'm trying to explain it. It's regular-expressions. They're just great. I was going to get mad that mySQL forces you to use a different syntax for more complex reg-exps, except that's what everyone else does. And the alternate syntax is just fine:
select name from cats where regexp_like(name, '[aeiou]$'; /* ends with a vowel */ where regexp_like(name, '^[a-z]*[a-z0-9]$'; /* only letters, may end in number */
Hmmm. To sum up, SQL has the boolean logic of a normal computer language, and easy-to-use regular expressions. SQL is still clearly inferior to being able to write a program, but not as inferior as I first thought.
It turns out databases' ridiculous consolidated-list tables are actually useful. Who knew! I mean, besides the inventor and everyone who's used them for the past 40 years. But beyond that, it's a complete surprise that SQL does some things better than I can do in code. Let's take the crude catEats table again, looking something like this:
cat_id food ------ ---- 101 mice 101 catnip 102 mice 102 cheese 102 phone cords ...
If SQL wasn't so terrible, this would be several small lists. Cat#101 would have food lisr (mice, catnip), cat 102 would have foods list (mice, cheese, phone cords), and so on. Or maybe each food would have a list of the cats which eat it. It turns out that SQL's odd format lets you do both, equally well.
GROUP BY lets you slice a sublist using any category you want, then lets you pull aggregate data out. At first it looks like another way to remove duplicates:
/* a useless group by, not showing its real purpose: */ select cat_id from catEats group by food; // mice // catnip // ... each food is listed once
So we found a longer way of writing: select distinct food from catEats;
.
But we're not seeing anything interesting because we didn't ask. The rows in a GROUP BY
are actually summaries of that group. In this case, various stats about every row with that food.
Below we ask for some:
select food, count(*), max(cat_id), sum(cat_id) from catEats group by food; // mice 2 102 203 // 2 cats eat mice, 102 is highest that does, sum is 203 // catnip 1 101 101 // a single cat (clearly cat #101) eats catnip // ...
You can specify the operation and the column it's on, giving lots of options:
max(wt)
would be the
heaviest food, and sum(price)
is the cost to buy 1 of each.
max(length(food))
is the longest food name each cat will eat:
/* get list of cat names and len of longest foods: */ select name, max(length(food)) as fLen /* column heading will be fLen */ from catOneFood group by name ) // fluffy 6 <-- fluffy might eat "rice" and "donuts" // max 4 <-- max might possible eat "ham", "corn", and "tin" // boots 6 <-- boots may eat "donuts" and "x"
Did you catch that nested function call? max(length(food))
?
SQL can nest function calls. I'm having more and more trouble hating it.
/* cat names and len of longest foods */ with catAndFoodLen as ( /* <-- catAndOneFood is an alias */ select name, max(length(food)) as fLen from catOneFood group by name ) /* use "fluffy 6" to filter for only fluffy's length 6 foods: */ select catOneFood.name, catOneFood.food from catOneFood join catAndFoodLen on catOneFood.name=catAndFoodLen.name and fLen=length(catOneFood.food);
But sql is still terrible. Get this: if you use group-by you have to replace WHERE with HAVING. Jerks. Here's "foods eaten by at least 2 cats":
select food count(cat_id) from catEats group by food /* sample row: "mice 4" */ having count(cat_id)>1; /* the same as a WHERE, removes some rows */
The thing is, there are actually 2 places we can filter rows: before the group-by, and after. Rows in the original table, or rows containing the computed summary. WHERE vs. HAVING. They're similar, but different. This uses both to get foods eaten by at least two 200-series cats:
/* foods eaten by at least 2 cats with id's of 200 or more: */ select food count(cat_id) from catEats where cat_id>=200 /* before we group by food, remove low-id cats */ group by food having count(cat_id)>1; /* remove low-count rows from the group-by summary */
The positions of WHERE and HAVING even help show us when they are applied.
The last thing I can complain about is the way count and max can be used outside
of group-by. select count(*) from cats;
gives the total amount of
cats. By comparison, select name, count(*) from cats group by name;
is a list with how many times each cat name occurs. That's confusing, but not really.
I already know I need to learn not to read sql front-to-back.
I may have found the real problem with sql: it makes too much darned sense. Computer languages are filled with bits of oddness -- shortcuts that are actually longcuts, patches and work-arounds. How are we expected to learn a language without a bunch of funny quirks?
Originally we created catEats
because we weren't able to give each
cat a personal food list. In our minds, it's not a real table. It's
a hack to store sub-tables.
Databases have a whole thing to describe this. When you have a single foreign key into another table, relational databases call that a many-to-one relationship, and draw it with a V-shaped arrow:
cats -------< catEats
We can imagine this as sublists. The V, and the direction it faces, represents how each cat has a V around all of its foods:
CATS CATEATS / 101 mice / 101 catnip cat #101 101 bread \ ... \ ... / 102 rice cat #102 102 shoes \ 102 mice
It's sooo pretentious, especially calling it many-to-one. Each food-item has an arrow to 1 cat. That's the "one". If we draw them as lines and think of them as 2-way, then each cat could have many arrows coming out of it to foods. That's the "many". Yeesh! Why not call it sublist ownership, which is what it is?
But suppose we have a list of states and every cat lives in one. Cats now have a state_id field. Technically that's another many-to-one arrow. Our new picture looks like this:
states -------< cats -------< catEats
/* there would be a master table for every state */ create table states ( id int, stateName varchar(16) ); insert into states (id, stateName) values (0,"catLand"); ... /* grow the cat table to have state_id's: */ alter table cats add state_id int; /* add extra to cats for the state_id */ update cats set state_id=1 where id=101; /* start adding state_id's to cats */
By my rules, this means cats is supposed to be split into sub-lists for each state. But that's not right.
Cats is the main list. The link to where a cat lives is only an attribute. So what does the
---<
many-to-one arrow actually mean? It could be two different things, so
how do you know?
Obviously, it means whichever we need it to mean, depending on what we're trying to
figure out right now. select states.id, count(*) from cats group by state_id;
treats cats
as if they were grouped by state. select name from cats where state_id=2;
is
like looking at state#2's personal cat-list.
What about "cats from states starting with a vowel?" Here's sql where we're looking through cats as one list, checking every state:
select name from cats join states on cats.state_id=states.id where regexp_like(states.stateName, ^[aeiou]);
On the other hand, we often imagine the WHERE as going first. Which means we're finding the vowel-states, then combining their cat sub-lists.
"How many different foods do the cats in each state eat?" is a little of each. It groups cats by state, then combines the foods for only those cats into a single list to get the count:
/* these 1st two lines replace the state_id with the stateName: */ select stateName, amt from states join /* this does the work: gets "state_id unique-food-count": */ (select cats.state_id, count(distinct food) as amt from cats join catEats on id=cat_id group by state_id) as t2 on states.id=t2.state_id;
So what does all of this mean?
There are lots of ways to arrange data. When we try to solve any given problem, we'll pick the way that makes it simplest. In a database, each particular query shapes the data into whatever form we like. Our only option for the actual storage is in a table, but that's fine. It reminds us that each time we have to chose to see it as a big list, or sublists, or whatever else.
In code it seems great that we can store data in the "best" way. But that's not so great. We still need to arrange it in various ways to solve various problems. It's optimized for the most common way, but it's going to be more of a pain for the problems where it isn't.
The example with the foods cats eat is clearly wrong. Listing the name of the food is silly. We should have a master food list. Instead of using the name of a food, use the food's id#.
/* master list of foods: */ create table foods ( id int primary key, name varchar(20) default 'chicken' vitA float /* amount of vitamin A */ );
In the modified catEats, "102 6" means cat#102 eats food#6. The new picture just a mess. catEats wants to be sliced up into cats, and also into foods. That makes no sense:
cats---< catEats >--- foods
To get a list with names of cats and foods, we'll need to join all three tables on the id's, which is a monstrosity:
select cats.name, foods.name from cats join catEats on cats.id=catEats.cat_id join foods on food.id=catEats.food_id; 3 | - 101 tuna | \ /fluffy => tuna, fluffy \food:3 | --/ cat #101 |
It's just plain silly. The catEats table doesn't have any real data. It only holds the 2 ends of an arrow. Oh, wait. That's an edge-list. It's a perfectly fine thing. Programmers love edge lists, and we store them exactly that way -- as pairs of end-points. The picture still doesn't make visual sense to me, but "table with 2 fat-end V's coming from it means an edge-list" is an easy rule to remember.
But the catEats design is still bad, since it's symmetric. It shouldn't be. I created it to give each cat a food-list, but the picture makes it look as if each food also has a cat list. Rats -- I just realized that's a good thing. It's another option. I can get tuna's cat-list for "cats that eat tuna":
/* look at tuna's list of the cats that eat it: */ select cats.name from cats join catEats on cats.id=catEats.cat_id where catEats.food_id=3; /* 3 is tuna */
Maybe my basic problem with sql is I suck at abstract thinking. If I saw
Foods[3].Cats[i]
in a program, I'd know we were looking at the cat sub-list
for that food. In sql that's still a thing, but it's not spelled out.
I can still prove SQL is terrible. What if each cat, next to the state_id, had a breed_id? We get this picture, where cats has two fat-end V's coming from it:
// each cat has 1 state and 1 breed: states -------< cats breeds ----------^ // same picture, with breeds moved: states ------- < cats > ------- breeds
It looks suspiciously as if cats is an edge-list linking states and breeds, which we know it isn't since we made it. Oh, gah! SQL is once again right and I'm wrong. The picture doesn't say cats is an edge-list. It's showing us that cats can be used as one. We could use it to find the number of states where each breed can be found ("tabby 46" means tabbies are found in most states):
select breedName, count(distinct cats.state_id) from breeds join cats on breeds.id=cats.breed_id group by breedName; /* <-- grouping by breed */
That's pretty cool, but relational databases still muck it up. They say states and breeds have a many-to-many relationship. That's just dumb. There's no relationship between states and breeds. You'd have to use the links in cats ... oh, wait, that's what we've been doing.
In fact, many-many is even a good, neutral way of saying it. It's letting us know we could group breeds by state, or the reverse, or could just think of each item having 0+ arrows to the other table.
I didn't mention this earlier, since it's so obvious -- programs can do math, IF's and other
logic that databases simply can't. We can look up to see that besides length(name)
,
SQL has ... what!! It's got a ton:
/* toLower and toUpper: */ select upper(name) from cats; /* FLUFFY */ /* padding, left and right: */ select lpad(name,10,"-") from cats /* ----fluffy */ /* string replace: */ select replace(name,'f','X') from cats; /* XluXXy */ /* it even works on strings? Arrg: */ select replace(name,'ff','X') from cats; /* fluXy */ /* ordering with strcmp: */ select * from cats where strcmp(name, "h")>=0;
But we still have it beat on math. SQL can't find names with an even number of letters. What! It can? Using the same math a program would?:
/* even-length cat names: */ select * from cat where length(name)%2=0; /* actual math: */ select * from cats as c1 join cats as c2 where c1.id-c2.id>2; ^^^^^^^^^^^^^ Are you kidding me?
Its even got IF's:
/* spreadsheet-style nested value-returning if's: */ select name, if(shipType=0,"boat", if(shipType=1,"ship","trawler")) from fleet; /* an IF to choose the WHERE test: */ select * from cats where if(length(name)>5, state_id=1, state_id=2);
I guess not. An ORM can convert our database into classes where each cat has a personal sublist of what it eats. That seems pretty cool. But SQL can also do that with an appropriate WHERE or GROUP BY. Sql can also easily treat catEats as 1 big list, or sublists on all sorts of criteria.
The way we arrange data in classes communicates the purpose. If there's no master list of all cats, instead a cat-list for each state, that tells us the program is mostly about the regional distribution of cats. But when do we need to be told that?
Putting our tables into classes gives us editor support. That's nice. SQL is just text strings -- the editor can't auto-complete a table name or flag WJERE. But SQL queries aren't that long, and your program can build them out of smaller strings. You could even write functions to help.
I think ASP.net is a key to understanding the ORM idea. It lets you write round-trip webpages using only C#. It translates it to and from javascript. An ASP.net programmer need know nothing about the web. As you'd guess, it's clunky, but there's an appeal to a having to learn nothing new. javascript and html are pretty easy, but apparently can be scary. SQL is worse -- it looks like COBOL, isn't case-sensitive, isn't imperative. If "make programmers learn another language" is a deal-breaker, an ORM looks pretty good.
---------------------------------------------------------------
Note 1: We could create fields food1, food2, and food3. But come-on.
As programmers, that's the same as trying to make a list using Food f1, f2, f3;
.
It's terrible in a DB for the same reasons it is in a program.
Comments. or email adminATtaxesforcatses.com