spacer.png, 0 kB
  June 20, 2013, 07:57:10

 
Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length


Pages: [1] 2 3
  Print  
Author Topic: Some thoughts on possible ways to speed up set generation in the future  (Read 7405 times)
thorwak

Posts: 202


View Profile
« on: April 14, 2010, 13:40:30 »

I'll provide some info on my setup before I get to the point since it's kind of relevant for the figures presented.

Physical host:
Intel i3 (2 physical cores (+ HT -> 4 logical cores)), each core running at 4 GHz
4 GB RAM (DDR3 running in dual channel, 1300 MHz or something like that)
500 GB system disk (VMs reside here too)
500 GB + 500 GB in raid0, ext4, dedicated to mysql usage, around 200 MB/s sustained read/write measured, load at around 0.45 at this data rate
Centos 5.3, kernel 2.6.32.5, 1000HZ, bigmem, 32bit, vmware server 2

MySql running on the physical host, which is mostly idle otherwise (load around 0.00-0.05 if mysql is idle). Load is between 1-3 when a huge set generation runs for hours and hours, yet there is actually free CPU time, typically one of the 4 cores idle (mysql can't thread these jobs enough I guess (I only run one at a time or load goes completely crazy))

URD webserver is running in a VM, 2 virtual cores (4 GHz), 512 MB RAM, Debian 5, Apache. urdd running in same VM. Load on VM this is usually 0.03 since mysql does all the real work on the physical machine.

All networking (physical and virtual) running at 1 GBit (switched)


Some examples of times it takes to generate sets from groups with max rentention (close to 2 years):

1.7M headers: 2 min 17 sec, 12.400 headers/s

2.4M headers: 41 sec, 58.536 h/s slightly more headers, but less sets (bigger posts)

4.3M headers: 6 minutes 0 sec, 11.944 (fairly big posts)

20.2M headers: 2 hours 26 min 15 sec, 2301 h/s (very many posts, not so big ones)

54.0M headers: Still running happily after over 24 hours, I have no idea have far it has gotten. Ready tonight? Next week? (very many sets, not so big ones)

1.7B headers: yeah right... Maybe for my grandchildren Cheesy

Data can be a bit skewed because of caching and so on, but one can clearly see one thing:
The actual table size/number of headers is not the only important factor, the number of sets seems
to matter a lot. The time it takes seem to scale almost logarithmically rather than linearly with header/set count. Could this open opportunites for optimizing?


As I stated elsewhere, I have yet to investigate the inner workings of URD and I am by no means any DB wizard, but:

What if one would split the large tables into, say one per day? (configurable option of course) say we have tables 1, 2, 3, .. 31 for 31 days of retention. We have another config option that controls how many days a set may span over. We set this to 3 in this example.

To be able to create sets for one month, we would have to search like this (worst case, already joined articles should be marked as "used") :

1-3, 2-4, 3-5, .., 29-31

That is, 29 different searches to generate sets for one group, and we build the list of complete sets gradually and we don't have to check against every existing set for every new search. This is of course a bit of reduntant work, but remember most of the headers will be already marked as member of existing set however, when we go over them the second and third time. Caching will also help a lot. The problem is that MySQL completely chokes on the extremely big selects/joins/however it's done as we try to do many million headers and thousands of sets at a time. (I'm impressed by the stability so far though...)

Say we have 54M headers as in the data above. 3 days per search would mean 10% of the amount of headers so around 5M headers. This could probably be done in 5-10 minutes, maybe even a lot less (remember most articles already marked), so total time would be, at worst, 150-300 minutes. 6 hours tops for something that will take days now probably. Not to mention, we would never have to go through everything from the beginning again once it's done...

20.2M headers with same math: 10% of data = 2.2M headers, maybe 30 secs a piece, 30 secs x 30 days = 15 minutes (at most), instead of 2.5 hours.

In reality, the measurements above have a retention of almost 2 years so for those groups one could use a lot more than 1 day per table and still get these results.

A nice bonus would be that one could have a working progress meter with this scheme. As it is now, my 24h set generation is still reporting 0% Smiley


More advanced strategies could include moving all headers already part of COMPLETE sets to another, "long-term", table so re-searching for new complete sets would only go over incomplete sets of headers. That would make it possible to catch complete posts that actually stretch over more than, in our example, 3 days.


This is all comepletely theoretical and thought up in about 10 minutes, probably leaving out important facts I don't know about, various logic errors an so on, but could there be something to it? Has something remotely similar been tried?

I realize I may have to do this myself since it probably requires a complete rewrite of many vital parts of URD, and for most ppl this is not even interesting, I guess?

I have far less time than I would like, and I'm gonna concentrate on finishing the Swedish translation first. But I would very much like a discussion and/or comments on my reasoning, and if anyone wants to have a go at this (creating a branched version? Smiley please go ahead!


I have a few other ideas to make initial indexing of extremely large groups nicer but maybe one should take one idea at a time Tongue


edit: typos
« Last Edit: April 14, 2010, 13:52:13 by thorwak » Logged
spearhead
Administrator
*
Posts: 1041


View Profile WWW
« Reply #1 on: April 14, 2010, 13:57:41 »

The current bottleneck AFAIK really is the IO, it is simple generating a lot of data (writing tables and indices) while it is reading even more data. If you time it you see that URDD is doing a brief time downloading it and then there is a long time updating the tables. Hence using compressed headers has hardly any effect timewise.

So the most gain could be made when limiting what is actually written.

Splitting it in separate tables has a lot of drawbacks:
- everything gets more complex (reading updating etc)
- if timestamp is taken, which one do we take, the first article of the set or the last or...?

I never cared much for the performance as I have a low expire date, so I never much looked into it either. But the process as a whole can be improved a lot.
Logged
thorwak

Posts: 202


View Profile
« Reply #2 on: April 14, 2010, 14:03:55 »

The measurements doesn't include any downloading of data at all, I'm ONLY talking about actual set generation from existing already downloaded headers. And if you examine the figures I posted you'll see that disk IO cannot be the explanation (then it would scale more linearly).

I agree it would be a LOT more complex though, that's why I'd like to discuss it. Smiley

I also understand this is a non-issue for many people, but not for all. If my reasoning is anywhere close to the truth it would make things that are completely impossible today (complete retention), because of the time it takes, very reasonable.

edit: clarification
edit2: Documentation for future comparisons: The 54.0M header set generation finally finished, after 49 hours, 35 minutes, 43 seconds. 302 headers/sec. Terrible Wink
« Last Edit: April 15, 2010, 15:38:51 by thorwak » Logged
thorwak

Posts: 202


View Profile
« Reply #3 on: April 16, 2010, 11:52:34 »

I will get around to trying to implement what I suggest above, I hope.

However, a "light" version of the above: The complete showstopper isn't that it takes days or even weeks for very large groups to generate sets - it's that it takes this long EVERY time, after every update, so i't impossible to keep up even after the initial sync.

This could be avoided, as a temporary workaround, by having an "archive" function. This would freeze a group with it's sets and eveything and never update it again, and renaming it (a.b.teevee.archive or whatever). It would only be used when browsing/searching sets. Possibly deleting old sets eventully when they go old.

Then one could have a fresh start with a more sensible retention of a couple of weeks or whatever. One would have both speedy updates (seconds/minutes) and full retention.

One can already achieve that by having 2 URD installations obviously, but that isn't so handy of course.

Advanced version would be to automatically move sets and articles to the archived version of the group eventually, if at all possible, but I guess it gets more difficult again here.

Finally, it would be cool to have the possibility to create an archive, from the usenet server like this: Get all headers. Before set generation, throw away everything that is older than x but newer than y. That way one could have several archives for the huge groups and make them manageable (monthly, or even weekly).

I know, it's not needed for most. Mostly thinking out loud so far, but I want this and I'll implement something like this at some point, maybe as a branched version if it's frowned upon for the mainline Smiley
Logged
spearhead
Administrator
*
Posts: 1041


View Profile WWW
« Reply #4 on: April 16, 2010, 12:43:30 »

For set generation I see more in rewriting the function so that after downloading a set of new headers, existing parts and sets get updated and new ones get added in stead of clearing the table. This should scale really well towards the size of the existing data set and depends only on the size of the headers to add. I'll look into this for 1.0.5. It probably needs some slight alterations in the tables.

I think that would be easier than using multiple tables, and for really large groups (eg boneless) that wouldn't make much difference anyway. Multiple tables makes everything really complex. Browsing, autodownload, expiring etc.

Implementing the archive function would be real easy. Basically find all sets that conform to your requirements than match that against all parts and articles and do a extraction to a file, or other db or whatever.
Logged
thorwak

Posts: 202


View Profile
« Reply #5 on: April 16, 2010, 13:01:19 »

That sounds promising for sure. I'll follow it with interest and I'd be happy to test some pre-alpha test code if there is need (as well as experiment on my own of course, when time permits)

Just having the archive function would improve the project by about 1000% for me personally Wink


Part of the problem [what makes set generation take so long] I would say is that it tries to match headers against every single existing set in the group (but correct me if I'm wrong). That's why we see 300 headers/s rather than 10.000-50.000 headers/sec (30x-150x difference!) in the really large groups. This would be solved by archives and/or multiple tables, but it can of course be solved otherwise too (only match against existing sets that are close enough "timewise" - then again, this check also takes time).

However, OBVIOUSLY the solution with not clearing the tables and just carrying on will help A LOT, and one don't have to implement everything at once. There is also the "good enough" factor versus breaking something because of complexity - I'm totally with you there.

(As usual, I'm talking without having all the facts so please do correct any faulty assumptions on my part...)

edit: typo
« Last Edit: April 16, 2010, 13:05:33 by thorwak » Logged
thorwak

Posts: 202


View Profile
« Reply #6 on: April 19, 2010, 14:10:28 »

Continuing my babbling as I test stuff out...

It wouldn't be enough with a one huge archive per group, as it turns out. The 50M headers, that took over 48 hours to generate sets for, have another problem: Browsing it is (mostly) fine, it takes a few seconds to return search result. Pretty good. But when I select a download from that group it takes about an HOUR to just collect all the needed parts before it can even start downloading. I guess it has to go through everything again somehow.

One really would have to use several different tables or otherwise completely different ways on selecting data. Even on quite small tables (compared to what archives would be like) mysqltuner reports:

Temporary table size is already large - reduce result set size
Reduce your SELECT DISTINCT queries without LIMIT clauses

I use:
tmp_table_size = 256M
max_heap_table_size = 256M
since otherwise it keeps complaining about how I should increase these.. Of course this is not something one wants to run with, and it doesn't even help, since temp tables are even bigger:

[!!] Temporary tables created on disk: 47% (500 on disk / 1K total)
Note that this from set generating, not downloading new headers...

I have yet to investigate (and play with) the SQL involved to generate sets, but unless one can easily split this up to run many times, in smaller runs (and in a meaningful way that is actually faster in the end) on huge tables the only solution is more tables, from what I know so far. Then again, if it would be as simple as making the results/temp tables a lot smaller by running many, serial, queries the solution could be quite simple  (sets beginning with a, then b, etc etc (overly simplified))

As earlier, mostly reporting finds and thinking out loud. I'm sure this can be solved in an elegant way once one know exactly what is taking time.

It's 99.9% waiting for the DB, the urd/urdd VM is at load 0.00 while db server is at load 1+ while generating sets. It's also not disk I/O, but a RAM and possibly CPU issue, it would seem. It's great up to 5M headers, OK up to 10M, then it starts going straight to hell performance-wise.

edit: tried to use clearer words in a couple of places
« Last Edit: April 19, 2010, 14:37:27 by thorwak » Logged
spearhead
Administrator
*
Posts: 1041


View Profile WWW
« Reply #7 on: April 19, 2010, 18:26:43 »

creating a dl should be quite fast as this takes an insert with a select. The insert could be slowish, but that has nothitg to do with the size of the source tables. There may be a problem with indices tho. Both should have indices on the binaryID. And binarise should have a key on setid. So it should actually be really fast.

This quire in do_adddate() in urdd/do_functions.phh does that.
Code:
$sql = "INSERT INTO downloadarticles (\"downloadID\", \"groupID\", \"status\", \"partnumber\", \"name\", \"messageID\", \"binaryID\", \"size\") "
. "SELECT '$dlid', '$groupid', '$status', \"partnumber\", bin.\"subject\", \"messageID\", par.\"binaryID\", par.\"size\" FROM $parts as par "
                . "LEFT JOIN $binaries as bin ON (bin.\"binaryID\" = par.\"binaryID\") WHERE bin.\"setID\" = '$setid'";

perhaps you can run a describe on the select like this:
Code:
describe SELECT '136', '609', '0', "partnumber", bin."subject", "messageID", par."binaryID", par."size" FROM parts_609 as par LEFT JOIN binaries_609 as bin ON (bin."binaryID" = par."binaryID") WHERE bin."setID" = 'c4f7db6d357f1872c3a05690b5fad13d';

numbers may vary tho.

you should see sth like
Code:
+----+-------------+-------+------+-------------------+--------------+---------+--------------------+------+-------------+
| id | select_type | table | type | possible_keys     | key          | key_len | ref                | rows | Extra       |
+----+-------------+-------+------+-------------------+--------------+---------+--------------------+------+-------------+
|  1 | SIMPLE      | bin   | ref  | PRIMARY,setID_idx | setID_idx    | 152     | const              |    7 | Using where |
|  1 | SIMPLE      | par   | ref  | binaryID_idx      | binaryID_idx | 98      | urddb.bin.binaryID |    1 |             |
+----+-------------+-------+------+-------------------+--------------+---------+--------------------+------+-------------+
2 rows in set (0.01 sec)

Don't forget to run :
Code:
SET sql_mode='ANSI_QUOTES';
first btw


Logged
thorwak

Posts: 202


View Profile
« Reply #8 on: April 22, 2010, 11:33:02 »

I have missed your post for some reason, sorry, will check out what you suggest and get back on this. I have a an inactive installation I can revive to run tests on, the one that took an hour to gather DL data so I can still do this easily.


However...

I've studied the generated SQL a bit now in real time during heavy runs and it's now more clear to me that the spirit of urd is to not worry about the underlying storage/db system (it's a PHP app after all). I respect that, and I see the simplicity approach more clearly now. Smiley

Rather than change that whole approach, which would add much complexity and move much of the logic from the DB to the app (which could fix performance though) and maybe even require rewrite of urdd in C or something to really gain speed and use RAM more effeciently, I suggest going with the archive idea, for those ( Cool ) who really want long retention.

I would like to see the following  (and I'll experiment with this myself if I can get a good enough grip on the whole process, and find the time) :

1) A way to get headers for a group between two dates (or timestamps), historically. Say 1 feb 00:00 - 28 feb 23:59 last year for instance. I haven't looked at the NNTP specs for over a decade but I seem to remember this might actually be difficult, something about how nntp readers are expected to work (get the latest since X days ago?). This isn't absolutely necessary but would make it nicer to index really large groups since otherwise it'll take a very long time just to DL ALL the headers at once. This would be used together with 3) below.

2) Downloading headers with the option of not generating sets afterwards (one can always abort the genset task though so not 100% necessary neither)

3) Creating archive table(s) for a group, and moving headers from the volatile/live group table. This would be done manually so app wouldn't have to be "smart" at all about this, just trust the user. (In the far future this could be automated though)

Say I have a.b.x with 500M headers, reaching 2 years back in time, doesn't matter if set names are generated or not (would take forever so there wouldn't be any sets in reality). I create an "archive group" that I get to choose a fitting name for (a.b.x.2009-02). I then say "move all headers from group a.b.x to a.b.x.2009-02 that were posted between 20090201 and 20090228", for example. Headers are moved, and a.b.x get smaller, and eventually small enough to able to generate the latest sets quickly.

The archive groups are handled just as any other group, just that they are never updated from the usenet server but instead one has to "move" headers there, manually. There will be a "breaking point" of course which will result in broken/incomplete sets. This can be handled by the user by creating overlapping archives. I.e. the next archive group would span from 20090227 to 20090331 for instance. This will initially result in (some) duplicate storage of data, but sets that are broken in the 2009-02 archive will get deleted by the expiring of incomplete sets function. If the overlap is big enough there will of course be sets that are complete in both archives. This is a price one has to pay for having perfect retention and a tradeoff left to the user to decide about. (Far future: a DB job could fairly easily identify complete duplicates and delete one instance, I imagine)

The rest of the functionality is already there - since it's a "normal" group one can choose to search for set names in a particular archive, or in all of them at the same time. Set generating, eventual expiration (when usenet server drops the articles, or if the user thinks 6 months is enough), purging and so on - it would all work as far as I understand. I haven't actually tried yet though Wink

Some handling is needed of course about set names reported to urdland, if user uses that. Easiest to just skip archives (it's old stuff anyway) or report the original group name and not the made up archive name (of course).


The major selling point here is that genset doesn't have to be run before the headers are moved into smaller archive tables. And this would fix all of the performance problems. MySQL currently completely chokes on the huge result set when we are above 10-20M headers, depending both on number of different sets and of course the number of headers. tmp tables get waay to large to ever fit in ram etc. As noted in earlier post twice the amount of headers takes 10 times longer to generate sets for (and also collect DL data, it seems) so this quickly escalates to an unusable state.

With this suggestion, set generation would run in smaller chunks (on the archives) and it'll be fast, yet app simplicity is mostly retained(?), world peace will follow etc.  Cool



Thoughts?



EDIT: typo; mixed up headers and sets at a couple of places in my wording.
EDIT2: Small logic error: To be able to create overlapping archives, the moving has to be a 2-step process of course: Create archives, COPY data from big table to smaller archive tables within specified timestamps, finally delete articles between specified timestamps from big table (or just plain expire, but this is a bit too crude tool to have as only option)
« Last Edit: April 22, 2010, 12:31:23 by thorwak » Logged
spearhead
Administrator
*
Posts: 1041


View Profile WWW
« Reply #9 on: April 22, 2010, 19:15:21 »

First thing I'll do is re-write the generate sets code, to make it more efficient. Then I'll look into other things... archival may actually be more easy to implement running in some scripts from cron. Or a simple php script with 2 database connections
Logged
thorwak

Posts: 202


View Profile
« Reply #10 on: April 22, 2010, 20:59:25 »

Yep was thinking about trying to do my tests outside of the app actually, I just have to figure out exactly what tables to update though.

If it's gonna be more than a special hack I use ppl would probably want to have it as a part of the web UI though..
Logged
thorwak

Posts: 202


View Profile
« Reply #11 on: April 24, 2010, 21:27:01 »

So I decided to stop theorizing and try out something real. I took the group that had 54M headers that took 49 hours to generate sets for, spanning 7-8 months back in time, and broke it up into 7 archives, 2009-09, 2009-10 .. 2010-03.

Depending on traffic volume the archives ended up having between 5M and 10M headers each in them.

The point would be along the reasoning in previous posts:

- Having an archive with complete retention of the group (this one seems to be only 7-8 months old)

- Speeding up the creation of the archive to a managable timeframe

- Being able to update the main group often without having to re-index everything (even though this will be addressed in future versions of URD(?))

- Maybe the results can help in decision making for future development of URD

- Learn about the URD DB and possibly contribute better myself in the future to the project

- For the hell of it Smiley


Basically I create a new table ("archives") that can be used alongside the table "groups". I ended up copying the contents of it to table groups so I wouldn't have to change any code to test out my ideas though.

Then 7 "fake" newsgroups are created, and the contents of the real group is spread over these (a bit over 53M headers). They overlap each other by 24h in both directions to avoid having any unusuable sets.

Then post count is updated and names changed for the fake groups. Finally they are added to the groups list. Then all I had to do was generate sets on them from within the web UI and searching, DLing etc just works Smiley The original group can then be purged, a low retention set and the archive groups never has to be touched again. One can run expire on them every now and then and stuff no longer on the Usenet server will eventually get deleted.

Set generation is currently done for the first 3 archives, and the time consumed follows the previous pattern:
Time for a 5M group: 10 mins 30 secs
Time for a 10M group: 40 minutes


If one would have stuck with 5M per group, it would have been 11 groups instead and taken 110 minutes, just under 2 hours, about 2h 30ms if inclduding the data copying. A lot better than 49 hours Smiley Now I expect it all to finish in roughly 5 hours instead, but that doesn't really matter - the point is that it works and that one never has to index that stuff again (and 90% of the time is still saved). Also, it's now proven the bottleneck is the massive amount of data processed at the same time resulting in huge temporary tables on disk, not the disk IO itself (we even duplicate all of the data!)


Here's a complete dump of all the SQL I did to create the archives for group 302:
Code:
CREATE TABLE archives LIKE groups;

CREATE TABLE parts_3020001 LIKE parts_302;
CREATE TABLE parts_3020002 LIKE parts_302;
CREATE TABLE parts_3020003 LIKE parts_302;
CREATE TABLE parts_3020004 LIKE parts_302;
CREATE TABLE parts_3020005 LIKE parts_302;
CREATE TABLE parts_3020006 LIKE parts_302;
CREATE TABLE parts_3020007 LIKE parts_302;

CREATE TABLE binaries_3020001 LIKE binaries_302;
CREATE TABLE binaries_3020002 LIKE binaries_302;
CREATE TABLE binaries_3020003 LIKE binaries_302;
CREATE TABLE binaries_3020004 LIKE binaries_302;
CREATE TABLE binaries_3020005 LIKE binaries_302;
CREATE TABLE binaries_3020006 LIKE binaries_302;
CREATE TABLE binaries_3020007 LIKE binaries_302;

ALTER TABLE parts_3020001 DISABLE KEYS;
ALTER TABLE parts_3020002 DISABLE KEYS;
ALTER TABLE parts_3020003 DISABLE KEYS;
ALTER TABLE parts_3020004 DISABLE KEYS;
ALTER TABLE parts_3020005 DISABLE KEYS;
ALTER TABLE parts_3020006 DISABLE KEYS;
ALTER TABLE parts_3020007 DISABLE KEYS;

INSERT INTO parts_3020001 SELECT * FROM parts_302 WHERE date > (select UNIX_TIMESTAMP("2009-09-01 00:00:00")) AND date < (select UNIX_TIMESTAMP("2009-10-02 00:00:00"));
INSERT INTO parts_3020002 SELECT * FROM parts_302 WHERE date > (select UNIX_TIMESTAMP("2009-09-29 00:00:00")) AND date < (select UNIX_TIMESTAMP("2009-11-02 00:00:00"));
INSERT INTO parts_3020003 SELECT * FROM parts_302 WHERE date > (select UNIX_TIMESTAMP("2009-10-30 00:00:00")) AND date < (select UNIX_TIMESTAMP("2009-12-02 00:00:00"));
INSERT INTO parts_3020004 SELECT * FROM parts_302 WHERE date > (select UNIX_TIMESTAMP("2009-11-29 00:00:00")) AND date < (select UNIX_TIMESTAMP("2010-01-02 00:00:00"));
INSERT INTO parts_3020005 SELECT * FROM parts_302 WHERE date > (select UNIX_TIMESTAMP("2009-12-30 00:00:00")) AND date < (select UNIX_TIMESTAMP("2010-02-02 00:00:00"));
INSERT INTO parts_3020006 SELECT * FROM parts_302 WHERE date > (select UNIX_TIMESTAMP("2010-01-30 00:00:00")) AND date < (select UNIX_TIMESTAMP("2010-03-02 00:00:00"));
INSERT INTO parts_3020007 SELECT * FROM parts_302 WHERE date > (select UNIX_TIMESTAMP("2010-02-27 00:00:00")) AND date < (select UNIX_TIMESTAMP("2010-04-02 00:00:00"));

ALTER TABLE parts_3020001 ENABLE KEYS;
ALTER TABLE parts_3020002 ENABLE KEYS;
ALTER TABLE parts_3020003 ENABLE KEYS;
ALTER TABLE parts_3020004 ENABLE KEYS;
ALTER TABLE parts_3020005 ENABLE KEYS;
ALTER TABLE parts_3020006 ENABLE KEYS;
ALTER TABLE parts_3020007 ENABLE KEYS;


INSERT INTO archives SELECT * FROM groups WHERE ID = 302;
UPDATE archives SET ID=3020001 WHERE ID = 302;
UPDATE archives SET postcount=(SELECT COUNT(*) FROM parts_3020001) WHERE ID = 3020001;
UPDATE archives SET name=CONCAT((SELECT name FROM groups WHERE ID = 302), ".arch.2009-09") WHERE ID = 3020001;
UPDATE archives SET first_record=0, mid_record=0, last_record=0 where ID = 3020001;

INSERT INTO archives SELECT * FROM groups WHERE ID = 302;
UPDATE archives SET ID=3020002 WHERE ID = 302;
UPDATE archives SET postcount=(SELECT COUNT(*) FROM parts_3020002) WHERE ID = 3020002;
UPDATE archives SET name=CONCAT((SELECT name FROM groups WHERE ID = 302), ".arch.2009-10") WHERE ID = 3020002;
UPDATE archives SET first_record=0, mid_record=0, last_record=0 where ID = 3020002;

INSERT INTO archives SELECT * FROM groups WHERE ID = 302;
UPDATE archives SET ID=3020003 WHERE ID = 302;
UPDATE archives SET postcount=(SELECT COUNT(*) FROM parts_3020003) WHERE ID = 3020003;
UPDATE archives SET name=CONCAT((SELECT name FROM groups WHERE ID = 302), ".arch.2009-11") WHERE ID = 3020003;
UPDATE archives SET first_record=0, mid_record=0, last_record=0 where ID = 3020003;

INSERT INTO archives SELECT * FROM groups WHERE ID = 302;
UPDATE archives SET ID=3020004 WHERE ID = 302;
UPDATE archives SET postcount=(SELECT COUNT(*) FROM parts_3020004) WHERE ID = 3020004;
UPDATE archives SET name=CONCAT((SELECT name FROM groups WHERE ID = 302), ".arch.2009-12") WHERE ID = 3020004;
UPDATE archives SET first_record=0, mid_record=0, last_record=0 where ID = 3020004;

INSERT INTO archives SELECT * FROM groups WHERE ID = 302;
UPDATE archives SET ID=3020005 WHERE ID = 302;
UPDATE archives SET postcount=(SELECT COUNT(*) FROM parts_3020005) WHERE ID = 3020005;
UPDATE archives SET name=CONCAT((SELECT name FROM groups WHERE ID = 302), ".arch.2010-01") WHERE ID = 3020005;
UPDATE archives SET first_record=0, mid_record=0, last_record=0 where ID = 3020005;

INSERT INTO archives SELECT * FROM groups WHERE ID = 302;
UPDATE archives SET ID=3020006 WHERE ID = 302;
UPDATE archives SET postcount=(SELECT COUNT(*) FROM parts_3020006) WHERE ID = 3020006;
UPDATE archives SET name=CONCAT((SELECT name FROM groups WHERE ID = 302), ".arch.2010-02") WHERE ID = 3020006;
UPDATE archives SET first_record=0, mid_record=0, last_record=0 where ID = 3020006;

INSERT INTO archives SELECT * FROM groups WHERE ID = 302;
UPDATE archives SET ID=3020007 WHERE ID = 302;
UPDATE archives SET postcount=(SELECT COUNT(*) FROM parts_3020007) WHERE ID = 3020007;
UPDATE archives SET name=CONCAT((SELECT name FROM groups WHERE ID = 302), ".arch.2010-03") WHERE ID = 3020007;
UPDATE archives SET first_record=0, mid_record=0, last_record=0 where ID = 3020007;

INSERT INTO groups SELECT * FROM archives WHERE ID > 3020000;

Some studying of the steps that actually took some time to execute (the rest took milliseconds to perform)
Code:
mysql> select count(*) from parts_302;
+----------+
| count(*) |
+----------+
| 54070376 |
+----------+
1 row in set (0.00 sec)

mysql> INSERT INTO parts_3020001 SELECT * FROM parts_302 WHERE date > (select UNIX_TIMESTAMP("2009-09-01 00:00:00")) and date < (select UNIX_TIMESTAMP("2009-10-02 00:00:00"));
Query OK, 5243535 rows affected (2 min 56.08 sec)
Records: 5243535  Duplicates: 0  Warnings: 0


mysql> INSERT INTO parts_3020002 SELECT * FROM parts_302 WHERE date > (select UNIX_TIMESTAMP("2009-09-29 00:00:00")) and date < (select UNIX_TIMESTAMP("2009-11-02 00:00:00"));
Query OK, 10050913 rows affected (3 min 17.82 sec)
Records: 10050913  Duplicates: 0  Warnings: 0


mysql> INSERT INTO parts_3020003 SELECT * FROM parts_302 WHERE date > (select UNIX_TIMESTAMP("2009-10-30 00:00:00")) and date < (select UNIX_TIMESTAMP("2009-12-02 00:00:00"));
Query OK, 8948429 rows affected (3 min 2.94 sec)
Records: 8948429  Duplicates: 0  Warnings: 0


mysql> INSERT INTO parts_3020004 SELECT * FROM parts_302 WHERE date > (select UNIX_TIMESTAMP("2009-11-29 00:00:00")) and date < (select UNIX_TIMESTAMP("2010-01-02 00:00:00"));
Query OK, 8157749 rows affected (2 min 55.27 sec)
Records: 8157749  Duplicates: 0  Warnings: 0


mysql> INSERT INTO parts_3020005 SELECT * FROM parts_302 WHERE date > (select UNIX_TIMESTAMP("2009-12-30 00:00:00")) and date < (select UNIX_TIMESTAMP("2010-02-02 00:00:00"));
Query OK, 7636793 rows affected (3 min 33.79 sec)
Records: 7636793  Duplicates: 0  Warnings: 0


mysql> INSERT INTO parts_3020006 SELECT * FROM parts_302 WHERE date > (select UNIX_TIMESTAMP("2010-01-30 00:00:00")) and date < (select UNIX_TIMESTAMP("2010-03-02 00:00:00"));
Query OK, 5661715 rows affected (3 min 20.94 sec)
Records: 5661715  Duplicates: 0  Warnings: 0

mysql> INSERT INTO parts_3020007 SELECT * FROM parts_302 WHERE date > (select UNIX_TIMESTAMP("2010-02-27 00:00:00")) AND date < (select UNIX_TIMESTAMP("2010-04-02 00:00:00"));
Query OK, 7794130 rows affected (4 min 8.94 sec)
Records: 7794130  Duplicates: 0  Warnings: 0

mysql> select sum(postcount) from archives;
+----------------+
| sum(postcount) |
+----------------+
|       53493264 |
+----------------+
1 row in set (0.00 sec)
Around 23 minutes to copy all data, 53.4M headers.



Code:
mysql> ALTER TABLE parts_3020001 ENABLE KEYS;
Query OK, 0 rows affected (3 min 3.56 sec)

mysql> ALTER TABLE parts_3020002 ENABLE KEYS;
Query OK, 0 rows affected (2 min 3.60 sec)

mysql> ALTER TABLE parts_3020003 ENABLE KEYS;
Query OK, 0 rows affected (1 min 40.53 sec)

mysql> ALTER TABLE parts_3020004 ENABLE KEYS;
Query OK, 0 rows affected (1 min 31.87 sec)

mysql> ALTER TABLE parts_3020005 ENABLE KEYS;
Query OK, 0 rows affected (1 min 25.87 sec)

mysql> ALTER TABLE parts_3020006 ENABLE KEYS;
Query OK, 0 rows affected (1 min 2.90 sec)

mysql> ALTER TABLE parts_3020007 ENABLE KEYS;
Query OK, 0 rows affected (0.00 sec)
(oops, seems I forgot to disable keys on table 7...)
Time for building indexes: Around 11 minutes


Logged
spearhead
Administrator
*
Posts: 1041


View Profile WWW
« Reply #12 on: April 25, 2010, 00:11:45 »

I implemented the algorithm partially. IE the binaries table is updated, the setdata is still recreated. Pls see if this help.

Note1: use the install/update_db_1.0.4_to_1.0.5.php script to update the table first
Note2: it's very experimental code and hasn't been tested much really.
Logged
thorwak

Posts: 202


View Profile
« Reply #13 on: April 25, 2010, 11:00:11 »

Wee, new toys Smiley

I'll test it out. It may be a while before I have anything to report since the upgrade alone may take many hours.. the alter table has been running for half an so far on that 302 table, and the total DB is over 50 GB. Maybe I should have started with something smaller Cheesy

Will study the code meanwhile.. From what I understand from a quick glance the initial set creation won't be faster, but the ones following will? No need to go over old grounds?

Have you considered limiting (in time, like I did with the archives) how many headers are processed at once? Even if not splitting the table once could easily do (for instance) overlapping month by month, week by week or whatever. That should be plenty fast anyway..?

(Like I said, I haven't even looked at your update properly yet, just bored staring at my busy DB  Cool )

Thanks for sharing the work btw! Of course I understand everything may crash and burn - it's SVN code..
Logged
spearhead
Administrator
*
Posts: 1041


View Profile WWW
« Reply #14 on: April 25, 2010, 11:07:17 »

the idea is to mark all new parts when dling them (dirty flag). Then only recreate binaries that have dirty parts and mark all these dirty as well. Then also recreate only sets that are dirty.

Now the sets part is also nearly finished and I found a glitch in the binaries parts (remove dirty mark.)

I could fit in a time stamp, but I don't think it will be necessary as the dirty is effectively a timestamp... unless you don't do an update for a really long time.
Logged
thorwak

Posts: 202


View Profile
« Reply #15 on: April 25, 2010, 11:11:18 »

I was thinking along the lines of grabbing everything there is, backwards in time, from the Usenet server. It'll be like an update that hasn't been run for 600+ days in my case, and this is actually what I'm after  Cool

Another solution would be (like I suggested in another post) if one could choose to get headers like "everything older than 600 days", run gensets, then get headers like "now I want 600 days - 580 days" and so on.

Get my idea? Is this supported by the NNTP protocol?


edit: typos
« Last Edit: April 25, 2010, 11:14:57 by thorwak » Logged
spearhead
Administrator
*
Posts: 1041


View Profile WWW
« Reply #16 on: April 25, 2010, 11:20:02 »

Yea I got the idea. But even in your case - if that would work with the server the dirty flag will catch this. Since only  the headers you got are marked dirty and sets will be generated for those

RFC 3977 (NNTP) does support this basically by the newnews command. But I haven't seen a server yet that supports this. I think most are stuck on the old rfc anyway. RFC 977. Which is really old. The newer over command isn't supported either, most of the time. The standard xover command only knows of messageIDs (the ones you see in the NZB) and article numbers (which are server specific).
Logged
spearhead
Administrator
*
Posts: 1041


View Profile WWW
« Reply #17 on: April 25, 2010, 11:24:31 »

I committed the whole update now
Logged
thorwak

Posts: 202


View Profile
« Reply #18 on: April 25, 2010, 11:29:44 »

I think ppl would find this possibility useful... However - don't work on things you don't believe in yourself Smiley There is nothing stopping me from doing this manually, and there could even be "third party tools" for URD in the future.

For instance, one could have a "bulk header" URD installation that sits there for days until it finally has downloaded those billions of headers in a certain group. Then it could feed those into the production DB table in nice chunks of 5-10 M at a time and trigger a gensets Cool

I'll be interesting to see what other bottle necks there may be when one gets that far. I have already started seeing that loading the general sets list is getting more sluggish when one has MANY sets. Eventually, it may be needed to limit the amount of sets returned by  a search (user should learn to search more precise anyway Wink ) Maybe just the stuff from the last few days or so needs to be displayed before an active search has been performed.



---
And you replied while I was writing, so here is reply to that:

Wouldn't everything I get be dirty ( Cheesy sorry ) the first time around though, so my case is actually not helped? The problem is that if I set the retention to 10 days, I can't go raise to 20 and expect to get older stuff. I have to do the full 600+ at once, and hell will freeze over before that stuff get indexed.. Or am I being really stupid here?

The workaround I suggest above though would save me however.. I could even hack it up so it just fetches all the headers to a special bulk table and then moves stuff from there every now and then..
Logged
thorwak

Posts: 202


View Profile
« Reply #19 on: April 25, 2010, 11:32:06 »

I committed the whole update now

Cool! Will check it out. (DB hasn't changed again, right? Cheesy )
Logged
thorwak

Posts: 202


View Profile
« Reply #20 on: April 25, 2010, 11:40:19 »

Thinking ahead even longer.. Fetching to a bulk table, and then moving from there once 5M (for example) articles is gotten would actually make it possible to generate sets as headers are being downloaded, at he same time.. That would be awesome. Even if the the complete thing would take hours or days, once could start seeing complete sets and DL them within minutes even in the insanely large groups.

Sorry that I keep "moving the goal" all the time, it's just how my mind works.. Feel free to ignore me when I go too far Wink
Logged
spearhead
Administrator
*
Posts: 1041


View Profile WWW
« Reply #21 on: April 25, 2010, 11:44:13 »

No the db hasn't changed. (would be interesting to check whether adding a key to dirty would speed up or slow things down... I don't keep large databases, but perhaps you can fiddle with that, by setting a key on the parts_XX and binaries_XX tables manually.)

Another BTW you might want to change the type of the dirty flag to tinyint if you're using mysql it would save some space. Psql doesn't understand this, so in the update script I set it to small int, in the db code when subscribing it is tinyint.

And you're correct. On the initial go it would make no difference, or even be slower, as I think the new algorithm is less efficient when the dirty percentage is near 100. But then again I might be wrong since it lacks the "order by" part now. Only some intensive testing will tell really. Cheesy
You may also want to fiddle a bit with GENSETS_STEPSIZE in functions/defines.php the default is 4000.

The problem is, I'm not sure what you are trying to achieve. If you want to have a complete past set of group data, you would have to create sets to make it useful for URD.
And esp with the new algorithm I don't really see that feeding it all at once or all in bulk would make much difference, but then again I might be wrong. IO may be an issue, but now the sorting is gone I guess it won't use as many temporary tables anymore.


Logged
thorwak

Posts: 202


View Profile
« Reply #22 on: April 25, 2010, 11:58:48 »

Hey that is a great idea - marking everything clean, and then 5M or so at a time as dirty will do the same thing, if we assume the problem is mainly the large result sets (which is the idea I've been pushing). I'll do that, I bet that'll work.

Ok, sorry I'm not always so clear, even more difficult in English sometimes.


Here's what I'd like to see working:

Giganews has 600+ days retention. AFAIK, I can get headers for ALL of that, historically. I have already seen it work for well over half a year (those 54M headers were gotten in a few hours, with header compression)

Now, if were to generate sets for all of those at once, it takes 49 hours (I tried this, as you know, just to see how long it would actually take.)

However, when I did the exact same thing, same amount of headers in my experiment yesterday, from scratch, I got it down to less than 10% of that time, and it could have been even faster (5M headers seems to be the sweet spot on my rig).

So yes, I want to create complete history. Of course I have to generate sets. And that is fine - it's just that updating the binaries_* tables slows down to a crawl when I do too many headers at once, because of the way MySQL (any most SQL DBs I would guess) works.


The "bulk talk" was meant like this: Download ALL headers to a bulk table. Don't do ANYTHING with that table except store data in it. Then, when we have enough, move 5 M headers to the "real" group (parts_xxxx). Mark it dirtt. Run set generation on this. Mark it clean. It'll take 10 minutes. When 5 M more headers are ready in the "bulk table", move them there, marking the new ones dirty. gensets. mark clean. 10 minutes.

We could do 50M headers in 2 hours this way, rather than 48 hours. Sure, there woud be slight IO overhead, but that the user can address (cache, fast disks).


Am I making sense? Smiley It would be like normal updates as far as URD is concerned, only fast-forwarding through time Smiley
« Last Edit: April 25, 2010, 12:01:09 by thorwak » Logged
spearhead
Administrator
*
Posts: 1041


View Profile WWW
« Reply #23 on: April 25, 2010, 12:10:59 »

I would be interested if the new algorithm is faster first.

Basically now that I thought it over a bit and looked at the code, the old one does an order by. On a large table, that is really crawls, as it probably uses a huge temp table. Hence the high Temporary tables created on disk % for mysqltuner.

For the new code I ditched this order, and restored a good old limit by.

Anyway, the larger the db grows, the slower queries will run in general, but the way I want to fix stuff is by optimising the code, indexes, queries etc. I don't want to delve into directions that has complications in tons of other parts of the code. Hence I am quite reluctant to split up the tables. I did that once, when we moved from a general parts and binaries table to group specific tables. And that was a lot of tricky work. It means that many parts of urd have to updated too and that will destabilize the code (browse, download, update/expire/purge/gensets) .
Logged
thorwak

Posts: 202


View Profile
« Reply #24 on: April 25, 2010, 12:15:41 »

I think what I will do is this, for now:

Start a new URD installation + (hey I have 3 already so Wink )

I'll test this new version by moving those 54M headers in 5M chunks from the other DB, running genset in between. This way, I'll be testing what a "normal" user would see happening over time (tables growing in sensible steps), and I guess that is more useful now (to find any bugs in the the new code). The other experimenting is tempting too, but I guess one can't do everything Wink Besides, I've been home sick this week (some weird inflammation) so I've had more time lately than in a long time, that will change tomorrow.

As for my own historic DB, I can build it gradually for now by having a bulk installation of URD that never generates sets, just downloads headers and then feeds them to the production DB gradually Smiley


So, this will mean I will basically test the code "as is". I feel reluctant to run a full set generation on 54M headers again at once since there is no way of telling how it's actually doing timewise. 1 hour? A week? Cheesy But perhaps, if it seems a LOT faster in general. I'll test it on some 10M headers tables shortly.
Logged
Pages: [1] 2 3
  Print  
 
Jump to:  

Powered by SMF 1.1.11 | SMF © 2006-2009, Simple Machines LLC
Amigri by Fakdordes
spacer.png, 0 kB
spacer.png, 0 kB
spacer.png, 0 kB