Find Duplicates and Delete

Post user created macros to share with others here.

Moderator: Moderators

Find Duplicates and Delete

Postby Carloche on Wed Sep 14, 2005 9:09 pm

This macro will allow you to quickly remove any duplicate entries found within a file. Files such as email lists, dictionary files, etc can be cleaned of all duplicate entries in seconds with a single run of the FindDup macro.
Attachments
FindDup.s
The aformentioned file, FindDup.s
(1.66 KiB) Downloaded 1116 times
Carl Hall
User avatar
Carloche
Registered User
 
Posts: 189
Joined: Tue Sep 30, 2003 3:29 am
Location: Tucson, AZ

Requires pre-sort, no?

Postby EEAnderson on Fri Sep 16, 2005 3:40 am

Unless I am mistaken, this macro would require that the file be presorted so that all duplicate lines are grouped together.

Of course, when sorted, the macro runs on the order of O(n) or somewhere like that. Unsorted and brute force method, it might run O(n!). :oops:

Perhaps an initial sort at the top of the macro?

Just a thought,
EEA
EEAnderson
Registered User
 
Posts: 8
Joined: Fri Sep 16, 2005 3:21 am

Another version found on TWiki site

Postby EEAnderson on Fri Sep 16, 2005 3:54 am

Interestingly, I found the following macro on the TWiki site for ME

http://www.multieditsoftware.com/twiki/pub/Main/UserScriptLibrary/dups.s

It appears to have been written by Andy Colson.

Code: Select all
void removedups()
{
   int start = c_line;

   while (true)
   {
      str s = get_line();
      down;
      while (Find_Text(s, 0, 0))
      {
         del_line;
      }
      start = start + 1;
      goto_line(start);
       if (at_eof) {
          break;
       }
   }
}
Regards,
EEA
EEAnderson
Registered User
 
Posts: 8
Joined: Fri Sep 16, 2005 3:21 am

Postby AndyColson on Fri Sep 16, 2005 4:54 pm

Ya know... Looking at that macro, I think it'd have to be sorted to work correctly.

Especially this:

Code: Select all
      if ( Line1 == Line2 )
      {
      ++Count;
      Up;
      Del_Line;      //   More information can be found on Del_Line on page 220 of the CMac Users Guide
    }


If we find a match, we go up a line and Del_Line. The only way that'll delete is if they are in sorted order.

I made myself a little test file like:

aa
bb
aa
cc

and it didnt seem to find any dups at all. When its sorted, it finds one dup.

Also, I note that it doenst do the entire file. It uses the line that the cursor is on and only find dups of that line below the cursor.

-Andy
User avatar
AndyColson
Registered User
 
Posts: 170
Joined: Sat Jul 26, 2003 1:29 pm
Location: Marion, IA

Postby AndyColson on Fri Sep 16, 2005 4:59 pm

Hum... Looking at my old macro (removedups).. I dont think its right either.

if you had a line like:

'aa'

then a line like

'this line contains aa as well'

they'd be seen as matching. My find_text call is a little too generic. It should probably do a regEx on ^s$ (to make sure the entire line matches from beggining of line to end of line)

-Andy
User avatar
AndyColson
Registered User
 
Posts: 170
Joined: Sat Jul 26, 2003 1:29 pm
Location: Marion, IA

Postby CharlesG on Tue Dec 06, 2005 9:43 pm

Tis file will remove all duplicates. if you find it doesn'[t please let me know.
Attachments
DelDups.s
(1016 Bytes) Downloaded 1048 times
CharlesG
Registered User
 
Posts: 116
Joined: Thu Aug 28, 2003 10:37 pm
Location: Raleigh, NC USA

Postby Ernie Zapata on Wed Dec 07, 2005 12:39 pm

This macro seems to remove all blank lines and does not appear to handle lines beginning with tab characters. Take for example the following data:
Code: Select all
This is not a test.
This is not a test 0.

   This is a test.
   This is a test.

This is not a test 1.
This is not a test 2.
This is not a test 3.
This is not a test 4.

   This is a test.
This is not a test 5.
This is not a test 6.
This is not a test 7.
This is not a test 8.

   This is a test.


I have attached the data file as a zip file.

All blank lines in the example have no white-space characters, just blank lines. Those lines with the text "This is a test." begin with a tab character, not spaces.

Running the macro against this sample data result in all blank lines being removed and none of the duplicate "<tab>This is a test." lines being removed.
Attachments
data.zip
Sample data file
(298 Bytes) Downloaded 1163 times
Ernie Zapata
Registered User
 
Posts: 93
Joined: Sat Jul 26, 2003 12:32 pm

Postby deleyd on Mon Apr 03, 2006 12:02 am

I just released my EDX 3.0 package which includes EDX NWS Sort. You can select
  • Mark Lines with Duplicate Keys
  • Weed Out Duplicate Keys
  • Delete ALL Lines with Duplicate Keys
  • Keep ONLY the Lines with Duplicate Keys
  • Summary Sort: Keep Keys and counts only
as well as do an ordinary sort with multiple keys mixed ascending/descending.

The EDX 3.0 package is at http://www.multieditsoftware.com/forums/viewtopic.php?p=1877#1877

(EDX NWS Sort is a major overhaul of NWS Sort submitted by Bret Sutton)
User avatar
deleyd
Developer
 
Posts: 1023
Joined: Tue Jul 29, 2003 4:27 pm
Location: Santa Barbara, CA

Postby CharlesG on Wed Apr 12, 2006 5:13 pm

Hi there,

The attached deldups.s should remove all duplicate lines. Please let me know if it doesn't work for anyone...
Attachments
DelDups.s
(1.21 KiB) Downloaded 1128 times
CharlesG
Registered User
 
Posts: 116
Joined: Thu Aug 28, 2003 10:37 pm
Location: Raleigh, NC USA

Postby CharlesG on Mon May 15, 2006 5:54 am

The attached DelDups.s works but seems to hang when processing:

5/11/2006 00:04 69,129,209 tcmd32.out

It has 827676 lines. I can place it somewhere if people need it ....
Attachments
DelDups.s
(1.44 KiB) Downloaded 1156 times
CharlesG
Registered User
 
Posts: 116
Joined: Thu Aug 28, 2003 10:37 pm
Location: Raleigh, NC USA

Postby AndyColson on Mon May 15, 2006 1:55 pm

The problem could be that its just really slow. How long have you let it run?

With 827,676 lines, and each line comparing to everything below it:

line 1 would compare 827,676 times.
line 2 would compare 827,675 times.
line 3 would compare 827,674 times.
... etc

So we sum thoes all up.
Given: 1 + 2 + 3 + 4 + . . . . + N = (1 + N)*(N/2)

(from http://mathforum.org/library/drmath/view/57919.html ... yes, I had to look it up :-) )

You would have a total of (1 + 827,676) * (827,676 / 2) = 827677 * 413838 = 342,524,194,326 comparisons.

That might take a while...

-Andy
User avatar
AndyColson
Registered User
 
Posts: 170
Joined: Sat Jul 26, 2003 1:29 pm
Location: Marion, IA

Postby CharlesG on Sat May 20, 2006 12:15 am

Does this work with the v9.0? product stream?

deleyd wrote:I just released my EDX 3.0 package which includes EDX NWS Sort. You can select
  • Mark Lines with Duplicate Keys
  • Weed Out Duplicate Keys
  • Delete ALL Lines with Duplicate Keys
  • Keep ONLY the Lines with Duplicate Keys
  • Summary Sort: Keep Keys and counts only
as well as do an ordinary sort with multiple keys mixed ascending/descending.

The EDX 3.0 package is at http://www.multieditsoftware.com/forums/viewtopic.php?p=1877#1877

(EDX NWS Sort is a major overhaul of NWS Sort submitted by Bret Sutton)
CharlesG
Registered User
 
Posts: 116
Joined: Thu Aug 28, 2003 10:37 pm
Location: Raleigh, NC USA

Postby deleyd on Sat May 20, 2006 5:55 am

Yes, EDX is currently for Multi-Edit 9.0 and 9.10 . And there'll be a version for the new Multi-Edit version 10 (called ME2006 I think. Just started Beta testing yesterday.)
User avatar
deleyd
Developer
 
Posts: 1023
Joined: Tue Jul 29, 2003 4:27 pm
Location: Santa Barbara, CA

Postby CharlesG on Tue Jul 28, 2009 3:08 am

Has anyone found a problem with my latest deldups.s macro - here?

http://www.multiedit.com/forums/viewtop ... =4866#4866
CharlesG
Registered User
 
Posts: 116
Joined: Thu Aug 28, 2003 10:37 pm
Location: Raleigh, NC USA


Return to User Created Macros

Who is online

Users browsing this forum: No registered users and 1 guest

cron