csv-join-similarity
4 months agoremove cr or lf from data to fix csv export master
Dobrica Pavlinusic [Fri, 8 Dec 2023 18:33:56 +0000 (19:33 +0100)]
remove cr or lf from data to fix csv export

4 months agocreate file using Text::CSV
Dobrica Pavlinusic [Fri, 8 Dec 2023 18:32:04 +0000 (19:32 +0100)]
create file using Text::CSV

4 months agocheck combined row 3
Dobrica Pavlinusic [Fri, 8 Dec 2023 17:53:22 +0000 (18:53 +0100)]
check combined row 3

4 months agocheck if old duplicate is longer and keep it
Dobrica Pavlinusic [Fri, 8 Dec 2023 10:42:39 +0000 (11:42 +0100)]
check if old duplicate is longer and keep it

not found in this dataset

4 months ago"quote," if , is in data
Dobrica Pavlinusic [Fri, 8 Dec 2023 09:18:40 +0000 (10:18 +0100)]
"quote," if , is in data

4 months agounac_string column names
Dobrica Pavlinusic [Fri, 8 Dec 2023 09:12:50 +0000 (10:12 +0100)]
unac_string column names

4 months agoadded broj_valova as last column in merged.csv
Dobrica Pavlinusic [Fri, 8 Dec 2023 09:10:27 +0000 (10:10 +0100)]
added broj_valova as last column in merged.csv

4 months agoremove .000000 from values
Dobrica Pavlinusic [Fri, 8 Dec 2023 08:58:49 +0000 (09:58 +0100)]
remove .000000 from values

5 months agoduplicate-$val.csv
Dobrica Pavlinusic [Mon, 27 Nov 2023 10:17:38 +0000 (11:17 +0100)]
duplicate-$val.csv

5 months agoduplicate.csv
Dobrica Pavlinusic [Mon, 27 Nov 2023 09:52:21 +0000 (10:52 +0100)]
duplicate.csv

5 months agodump more info about duplicate input rows
Dobrica Pavlinusic [Sun, 26 Nov 2023 17:35:34 +0000 (18:35 +0100)]
dump more info about duplicate input rows

5 months agoadd _val to column names to make them unique
Dobrica Pavlinusic [Thu, 23 Nov 2023 20:14:35 +0000 (21:14 +0100)]
add _val to column names to make them unique

5 months agocleanup ids correctly (only \w and \d allowed)
Dobrica Pavlinusic [Wed, 22 Nov 2023 13:50:24 +0000 (14:50 +0100)]
cleanup ids correctly (only \w and \d allowed)

5 months agocleanup after merge, produce valid output
Dobrica Pavlinusic [Wed, 22 Nov 2023 11:33:47 +0000 (12:33 +0100)]
cleanup after merge, produce valid output

5 months agodump merged.csv
Dobrica Pavlinusic [Wed, 22 Nov 2023 11:27:41 +0000 (12:27 +0100)]
dump merged.csv

5 months agocleanup output, maintain merged $data
Dobrica Pavlinusic [Wed, 22 Nov 2023 09:21:19 +0000 (10:21 +0100)]
cleanup output, maintain merged $data

5 months agoval
Dobrica Pavlinusic [Wed, 22 Nov 2023 08:36:03 +0000 (09:36 +0100)]
val

5 months agocollect A_ counts (original data stats) only on first loop
Dobrica Pavlinusic [Tue, 21 Nov 2023 16:14:38 +0000 (17:14 +0100)]
collect A_ counts (original data stats) only on first loop

5 months agocorrupt razred/skola
Dobrica Pavlinusic [Tue, 21 Nov 2023 09:01:15 +0000 (10:01 +0100)]
corrupt razred/skola

5 months ago0.7 is ok, 0.6 is too random
Dobrica Pavlinusic [Thu, 16 Nov 2023 14:15:55 +0000 (15:15 +0100)]
0.7 is ok, 0.6 is too random

5 months agosort kandidates by score, and if same prefer longer one
Dobrica Pavlinusic [Thu, 16 Nov 2023 11:55:54 +0000 (12:55 +0100)]
sort kandidates by score, and if same prefer longer one

5 months agotry all limits from 0.9 in descending orders
Dobrica Pavlinusic [Thu, 16 Nov 2023 10:39:20 +0000 (11:39 +0100)]
try all limits from 0.9 in descending orders

5 months agoenv LIMIT=0.9 is default
Dobrica Pavlinusic [Wed, 15 Nov 2023 09:03:17 +0000 (10:03 +0100)]
env LIMIT=0.9 is default

5 months agocheck duplicate before merge
Dobrica Pavlinusic [Tue, 14 Nov 2023 22:45:40 +0000 (23:45 +0100)]
check duplicate before merge

5 months agocleanup, merge only non-duplicate val for keys
Dobrica Pavlinusic [Tue, 14 Nov 2023 21:20:29 +0000 (22:20 +0100)]
cleanup, merge only non-duplicate val for keys

5 months agomerge at end
Dobrica Pavlinusic [Tue, 14 Nov 2023 19:45:39 +0000 (20:45 +0100)]
merge at end

5 months agocleanup, collect unique_id
Dobrica Pavlinusic [Tue, 14 Nov 2023 11:04:51 +0000 (12:04 +0100)]
cleanup, collect unique_id

5 months agosimilarity 0.9, merge all suggestions
Dobrica Pavlinusic [Tue, 14 Nov 2023 09:43:07 +0000 (10:43 +0100)]
similarity 0.9, merge all suggestions

5 months agosimilarity on all keys
Dobrica Pavlinusic [Tue, 14 Nov 2023 09:31:36 +0000 (10:31 +0100)]
similarity on all keys

5 months agosimilarity, forward progress only
Dobrica Pavlinusic [Tue, 14 Nov 2023 09:00:04 +0000 (10:00 +0100)]
similarity, forward progress only

5 months agosimilarity
Dobrica Pavlinusic [Tue, 14 Nov 2023 06:55:19 +0000 (07:55 +0100)]
similarity

5 months agofirst cut
Dobrica Pavlinusic [Mon, 13 Nov 2023 21:48:21 +0000 (22:48 +0100)]
first cut