Skip to content

Implement Suffix Automaton for difflib's Longest Common Substring #144132

@dg-pb

Description

@dg-pb

Feature or enhancement

Proposal:

O(n) solution is not that complex - 100 of Python lines.

I hacked in initial implementation and ran context_diff on the difflib.py - the change itself.

import difflib, time

with open('/Users/Edu/lib/src/cx_amb/src/cx/cli/snd/diff/difflib.py') as f:
    s1 = f.read().splitlines(keepends=True)

with open('/Users/Edu/lib/src/cx_amb/src/cx/cli/snd/diff/difflib_new.py') as f:
    s2 = f.read().splitlines(keepends=True)

list(difflib.context_diff(s1, s2, fromfile='before.py', tofile='after.py'))
# 2.4 ms (Current)
# 2.9 ms (New)

So 30% faster.
So 20% slower.
Also, this is a fairly rough plug-in - there is a chance that the gap can be bridged.

O(n) solution should avoid some of the user dissatisfaction:
#106865

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

Made a couple of posts in unrelated thread.
Performance tables are there showing that this is faster even for sizes of 10.
https://discuss.python.org/t/include-time-and-space-complexity-in-documentation/105683/32

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagestdlibStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions