Most common repeated strings

ConversesPurely Programmers

Afegeix-te a LibraryThing per participar.

Most common repeated strings

Aquest tema està marcat com "inactiu": L'últim missatge és de fa més de 90 dies. Podeu revifar-lo enviant una resposta.

jul. 25, 2009, 10:19 am

I'm looking for an easy way to identify the most common repeated strings in some code. I've got a big hunk of HTML and JavaScript with many repetitions of very long class, id and function names (eg., "LibraryThing.collections.toggleBookForCollectionFromMenu"). I want to simplify it down by making them smaller.

I'd love a script, preferably in PHP, that showed the longest such strings, so I can approach the issue systematically. I'm guessing this is an easy problem, and that it's baked into algorithms like ZIP. I'm sure there's an elegant algorithm for it.

Anyone have a good solution?

jul. 25, 2009, 1:08 pm

I tried it over on StackOverflow. The solution is VERY slow. I've got my own, longer and more annoying one that's a little faster, mostly because it goes word by word.

jul. 25, 2009, 1:12 pm

Sounds like a variant on
Not very easy to implement in the most general case.

In your case however, you can use something easier as you probably want strings matching a-zA-z0-9\.+ and in that case some kind of brute force regex+hash solution will suffice. Do you have some example input?

jul. 27, 2009, 7:39 pm

There is a solution to a similar problem in Jon Bentley's Programming Pearls, Column 15. He uses a data structure called a suffix array to create a very efficient solution for plain text. You might need to extract what you're interested in from the program text first, but Bentley's solution should work.

Column 15 of the book is available on the books website:


Editat: jul. 28, 2009, 4:16 pm

Hmm. I suspect that the normal unix tools are up for this job, unless you have gigabytes of code to analyze. We don't need no stinkin' elegance!

This pipeline will find the longest identifiers for you:

grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort | uniq | perl -n -e 'print length($_)-1, " ", $_' | sort -nr | head -10

(replace all braces with brackets to make it work. Is there any way to turn touchstones off?)

If you want to sort by the product of length and repetitions, in order to find the total bytes wasted on an identifier, you can use uniq -c and do a bit more magic in the perl script like this:

grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort | uniq -c | perl -na -e 'print $F{0} * length($F{1}), " $F{1}\\n"' | sort -nr | head -10

(again, replace all braces with brackets)

If you have too many files to list them all in $FILES, then grep -r is very nice.

Editat: jul. 28, 2009, 4:57 pm

#5: oohhh, that's clever :)

I don't think there's any way to turn off touchstones, though this might work: [whatever]
[] -> [ ]
ETA: works. very awkward in editing though.

Or just link to a pastie: Ruby solution

jul. 29, 2009, 6:06 am

Yes, #5 is clever. I'd use sed to split on blanks and maybe add a test so only strings that occur more than once are included, but that's just bells and whistles.

jul. 29, 2009, 6:29 am

uniq has an option for that too: add -d to only print lines that occur more than once. I really love uniq :) And especially the combination of sort | uniq -c | sort -n has come in handy many times.

Where would you use sed? I didn't need it because grep -o already prints each match on a separate line. I didn't know about grep -o before I looked it up for this task, so clue++ for me :)

jul. 29, 2009, 8:11 am

I'd use

sed -e 's/ /\n/g' $FILES | sort ...

rather than

grep -oh '{a-zA-Z}{a-zA-Z0-9.}*' $FILES | sort ...

so I'd get some slightly different strings, but that's just details. Hmm, uniq -d I didn't know about, so I have a few scripts with uniq -c | grep -v ' 1 '

clue++ for me too.

ag. 3, 2009, 9:55 am

I would personally get a programming IDE with refactoring capability to do this job. Eclipse seems to have a plugin for JavaScript refactoring: