Regex-function to merge endnotes files in editor

EbookMakers · 10-21-2020, 03:40 AM

Do you have old epubs with one xhtml page per endnote? It is from version 3 of calibre that Kovid proposed a checkbox (in the docx configuration) preventing this separation between the endnotes during a docx -> epub conversion by calibre. An epub -> epub conversion can't change it.

The interface of calibre makes it easy to manually group the notes into a single page, the longest being to determine which files are affected...

I try to do this with a regex-function to run in automatic mode. It runs without an error message.

I have two issues:

1) The editor interface is not updated at the end of the regex-function. If I save a copy of the epub, and examine that file, it shows that the merge was successful. Without really knowing whether to look this side, I tried using apply_container_update_to_gui, but I was unsuccessful. How to update the interface?

2) The function is executed in the "spine order". But how to indicate that we want to start with the 1st file of the book regardless of the current file, in order to group the notes in the 1st note file.

A test file is joined.

Spoiler:

kovidgoyal · 10-21-2020, 05:27 AM

Just ctrl-click the files in the files browser in the editor, then right lick and choose merge.

EbookMakers · 10-21-2020, 07:24 AM

Thank you for your answer. I Know, this is the reason why I started writing : "The interface of calibre makes it easy to manually group the notes into a single page, the longest being to determine which files are affected...". And the regex without the function can help me find which files are affected.
Will it make the function do what I hope?

kovidgoyal · 10-21-2020, 08:08 AM

To refresh the ui use the boss object

from calibre.gui2.tweak_book.boss import get_boss
get_boss().apply_container_update_to_gui()

EbookMakers · 10-21-2020, 09:11 AM

Thanks a lot, Kovid. I'll try.

EbookMakers · 10-21-2020, 10:15 AM

By modifying the function as you indicated, in the editor, the files which are not the "merge master" are deleted. But the "merge master" file still contains only one note.

However, the function is executed correctly: on a "commit as" we find all the desired modifications in the saved file.

Is there also a solution to force the search to start with the 1st file and not with the current file ?

EbookMakers · 11-23-2020, 09:05 PM

A test epub is attached to the lead post of this topic. We can think of two solutions. A solution for well-behaved people like you and even me, and a solution for rascals. They use the same regex:

Code:

<body[^\n]*\n\K\s*(<h[^>]*>[^<]*</h\d>)?\s*<dl[^>]*>\s*<dt[^>]*>\[<a\b(?:(?!</dl).)+</dl>\s*(?=</body>)

The \K switch resets the selection. The expression placed before the switch is equivalent to a positive backward assertion. I use it, for my own reasons, to maintain compatibility with the PCRE engine which does not accept variable length back assertions as it does here.

On an epub respecting the html syntax resulting from a docx -> epub conversion, the regex selects:

- the note in files containing one note and only one according to the syntax of the conversion, ensuring that the note is surrounded by the pair of body tags.
- in optional group 1, the title preceding the 1st note only (after the conversion).

The regex successively selects the solitary notes which respect the syntax of the conversion. It therefore also allows you to know the name of the xhtml files which contain them. Asking the regex for counting would tell if the epub is affected by the purpose of the regex-function. Merging of notes should only be requested if there are at least two notes. If group 1 exists, the file contains the 1st note.

We cannot predict on which (active) file the regex will start. We can ask that it browse the files in the “spine” order with the parameter:
replace.file_order = 'spine'

We only know that the occurrence for which group 1 exists is the 1st note. Both solutions rely on this characteristic to obtain a file with the notes starting with the 1st note and then in the correct order. Otherwise, as stated in a previous message, the order of the notes in the result file would depend on the active file when launching the regex.

One argument to the replace function is “data”, which is a persistent ׅ “dic” during the execution of the function. Our two functions store their information in this dic.

It is possible to request that the function be executed a last time after the last occurrence:
replace.call_after_last_match = True

It is in this last time that the merge will be requested. Merge updates notes calls in the text and the opf file (since it deletes files). The display must then be updated in the editor as written above by Kovid:
get_boss (). apply_container_update_to_gui ()

A major problem is that the result of the regex-function comes from the “return” of the “replace” function, even though the merge is executed after processing the last occurrence! One would have expected that the result of the regex-function would come from the "merge". The main difference between the two solutions is how to work around this problem.

Both functions are commented.

EbookMakers · 11-23-2020, 09:06 PM

The function for rascals

The function builds two lists of filenames that it feeds depending on whether it has already encountered the file containing the title or not, which depends on the active file when the regex is launched. The file containing the title is the one containing the 1st note, the one on which the merge will be done.

The two lists of file names are merged to get the complete list of files to be merged.

We raise an exception with the 'raise' instruction to stop the function after the 'merge' and before the 'return' which would cancel the result. This is the dirty side of the job.

At runtime, a warning message appears, which says: Merging: Files merged out.

Code:

from calibre.gui2.tweak_book import current_container
from calibre.gui2.tweak_book.boss import get_boss
from calibre.ebooks.oeb.polish.split import merge

class Merging(LookupError):
    pass
    # Warning : very dirty work around
    # Custom class exception, to provoke the end of job without return
    # If we use return, we loose the result of the merging

    
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    
    if match is None:	# this is the last passage (all matches found)
        ctnr = current_container()
        # Merge files whose name is in the list 'note_files_list'
        # (stored in persistent dic 'data', a parameter of replace()) 
        # into the file whose name is in 'merge_master' (also stored in data)
        data['note_files_list'] = data['first_notes'] + data['last_notes']
        if data and len(data['note_files_list']) > 1:
            merge(ctnr, 'text', data['note_files_list'], data['merge_master'])
            get_boss().apply_container_update_to_gui()

        # very dirty trick : get out without applying 'return' :
        raise Merging("Files merged out")  
                        
    else:

        if 'merge_master' not in data :  
            # data is empty, therefore it's the 1st iteration 
			# the list of files and the master of merge are initialized
            data['note_files_list'] = []
            data['merge_master'] = []
            
            if match.group(1):
                # If group 1 exists, the note contains the title and therefore the note is the 1st note
                # The master of merge becomes the current note file
                data['first'] = True
                data['merge_master'] = file_name
                data['first_notes'] = [file_name]
                data['last_notes'] = []
            else:
                data['first'] = False
                data['first_notes'] = []
                data['last_notes'] = [file_name]

            # Ask for a passage after the last find (match will be None)
            # Ask for processing the files in the order they appear in the book
            replace.call_after_last_match = True
            replace.file_order = 'spine'
           
        else:
            if match.group(1):
                data['first'] = True
                # The master of merge becomes the current note file
                data['merge_master'] = file_name
            # Increments the list of files by adding the name of the current file
            if data['first']:
                data['first_notes'].append(file_name)
            else:
                data['last_notes'].append(file_name)
           
        return match.group()

EbookMakers · 11-23-2020, 09:07 PM

The function for well-behaved people

This function is not interrupted, the ‘return’ is executed. But this 'return' replaces the selected note by the set of notes already encountered, and ordered by a mechanism similar to that of the rascal function. Therefore, after 'merging' and 'returning', we get a single file containing all of the ordered notes.

Code:

# <body[^\n]*\n\K\s*(<h[^>]*>[^<]*</h\d>)?\s*<dl[^>]*>\s*<dt[^>]*>\[<a\b(?:(?!</dl).)+</dl>\s*(?=</body>)

from calibre.gui2.tweak_book import current_container
from calibre.gui2.tweak_book.boss import get_boss
from calibre.ebooks.oeb.polish.split import merge

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):

    if match is None:	# this is the last passage (all matches found)
        ctnr = current_container()
        # Merge at least 2 files whose name is in the list 'note_files_list'
        # (stored in persistent dict 'data', a parameter of replace()) 
        # into the file whose name is in 'merge_master' (also stored in data)
        if data and len(data['note_files_list']) > 1:
           merge(ctnr, 'text', data['note_files_list'], data['merge_master'])
           get_boss().apply_container_update_to_gui()

    else:

        if 'merge_master' not in data :  
            # data is empty, therefore it's the 1st iteration
            # the list of files is initialized with the current note file
            # The master of merge is the current note file
            data['note_files_list'] = [file_name]
            data['merge_master'] = file_name

            if match.group(1):
                # If group 1 exists, the note contains the title and therefore the note is the 1st note
                data['first'] = True
                data['first_notes'] = match.group()
                data['last_notes'] = ''
            else:
                data['first'] = False
                data['first_notes'] = ''
                data['last_notes'] = match.group()

            # Ask for a passage after the last find (match will be None)
            # Ask for processing the files in the order they appear in the book
            replace.call_after_last_match = True
            replace.file_order = 'spine'
           
        else:
            # Increments the list of files by adding the name of the current file
            # The master of merge becomes the current note file
            data['note_files_list'].append(file_name)
            data['merge_master'] = file_name
            if match.group(1):
                data['first'] = True
            if data['first']:
                # If first is true, the function has already processed the 1st note,
                # we concatenate in first_notes
                data['first_notes'] = data['first_notes'] + match.group()
            else:
                # Otherwise in last_notes
                data['last_notes'] = data['last_notes'] + match.group()

        data['all_notes'] = data['first_notes'] + data['last_notes']
        # print (['note_files_list'], data['merge_master'])
        return data['all_notes']

EbookMakers · 11-24-2020, 09:39 AM

I modified the function of the #8 message to add the 'pass' statement in the class 'Merging', instead of relying on the comment to do nothing.

roger64 · 11-26-2020, 11:09 PM

Hi

Maybe I misunderstood something, but I can't see why it would be necessary to use a regex-function for this.

1. The calibre editor, as Kovid wrote, can merge all the notes placed in their own pages.
2. On the CSS side, I fail to see the usefulness of ordered list code for the footnotes. It's a separate issue of course.

Here is the end result after these two changes.

EbookMakers · 11-27-2020, 09:35 AM

The regex alone is interesting since its count allows to know immediately if the epub is concerned or not. Unless we created them ourselves, we don't necessarily know a lot about our epubs.

It is correct that it is not necessary to use a regex-function to merge note files, this was also written from the #1 message.

This is just a small, unpretentious exercise that first shows a use of 'merge' and 'apply_container_update_to_gui'.

It uses 'replace.call_after_last_match = True' and shows that the content of the 'return' triumphs over changes in the text by this last call when one would expect the opposite. It gives 2 ways to overcome this constraint.

It also shows some data manipulation in the persistent dic 'data'.

10-21-2020, 07:24 AM	#3
EbookMakers Enthusiast Posts: 26 Karma: 38 Join Date: Nov 2019 Location: Paris, France Device: none	Thank you for your answer. I Know, this is the reason why I started writing : "The interface of calibre makes it easy to manually group the notes into a single page, the longest being to determine which files are affected...". And the regex without the function can help me find which files are affected. Will it make the function do what I hope? Last edited by EbookMakers; 10-21-2020 at 07:29 AM.

10-21-2020, 09:11 AM	#5
EbookMakers Enthusiast Posts: 26 Karma: 38 Join Date: Nov 2019 Location: Paris, France Device: none	Thanks a lot, Kovid. I'll try. Last edited by EbookMakers; 10-21-2020 at 10:15 AM.

11-23-2020, 09:05 PM	#7
EbookMakers Enthusiast Posts: 26 Karma: 38 Join Date: Nov 2019 Location: Paris, France Device: none	A test epub is attached to the lead post of this topic. We can think of two solutions. A solution for well-behaved people like you and even me, and a solution for rascals. They use the same regex: Code: <body[^\n]\n\K\s(<h[^>]>[^<]</h\d>)?\s<dl[^>]>\s<dt[^>]>\[<a\b(?:(?!</dl).)+</dl>\s(?=</body>) The \K switch resets the selection. The expression placed before the switch is equivalent to a positive backward assertion. I use it, for my own reasons, to maintain compatibility with the PCRE engine which does not accept variable length back assertions as it does here. On an epub respecting the html syntax resulting from a docx -> epub conversion, the regex selects: - the note in files containing one note and only one according to the syntax of the conversion, ensuring that the note is surrounded by the pair of body tags. - in optional group 1, the title preceding the 1st note only (after the conversion). The regex successively selects the solitary notes which respect the syntax of the conversion. It therefore also allows you to know the name of the xhtml files which contain them. Asking the regex for counting would tell if the epub is affected by the purpose of the regex-function. Merging of notes should only be requested if there are at least two notes. If group 1 exists, the file contains the 1st note. We cannot predict on which (active) file the regex will start. We can ask that it browse the files in the “spine” order with the parameter: replace.file_order = 'spine' We only know that the occurrence for which group 1 exists is the 1st note. Both solutions rely on this characteristic to obtain a file with the notes starting with the 1st note and then in the correct order. Otherwise, as stated in a previous message, the order of the notes in the result file would depend on the active file when launching the regex. One argument to the replace function is “data”, which is a persistent ׅ “dic” during the execution of the function. Our two functions store their information in this dic. It is possible to request that the function be executed a last time after the last occurrence: replace.call_after_last_match = True It is in this last time that the merge will be requested. Merge updates notes calls in the text and the opf file (since it deletes files). The display must then be updated in the editor as written above by Kovid: get_boss (). apply_container_update_to_gui () A major problem is that the result of the regex-function comes from the “return” of the “replace” function, even though the merge is executed after processing the last occurrence! One would have expected that the result of the regex-function would come from the "merge*". The main difference between the two solutions is how to work around this problem. Both functions are commented.

11-24-2020, 09:39 AM	#10
EbookMakers Enthusiast Posts: 26 Karma: 38 Join Date: Nov 2019 Location: Paris, France Device: none	I modified the function of the #8 message to add the 'pass' statement in the class 'Merging', instead of relying on the comment to do nothing. Last edited by EbookMakers; 11-24-2020 at 01:27 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Help creating possible Regex-Function	MerlinMama	Editor	14	03-03-2020 05:53 AM
Predefined regex for Regex-function	sherman	Editor	3	01-19-2020 05:32 AM
Merge Books function behaviour change	toomuchreading	Library Management	4	04-11-2018 02:20 PM
Regex Function about «» and “”	senhal	Editor	8	04-06-2016 02:12 AM
Is there a way to merge tags, preferably via regex?	Awfki	Calibre	7	10-31-2015 03:55 PM

10-21-2020, 05:27 AM	#2
kovidgoyal creator of calibre Posts: 43,935 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Just ctrl-click the files in the files browser in the editor, then right lick and choose merge.

10-21-2020, 08:08 AM	#4
kovidgoyal creator of calibre Posts: 43,935 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	To refresh the ui use the boss object from calibre.gui2.tweak_book.boss import get_boss get_boss().apply_container_update_to_gui()

11-27-2020, 09:35 AM	#12
EbookMakers Enthusiast Posts: 26 Karma: 38 Join Date: Nov 2019 Location: Paris, France Device: none	The regex alone is interesting since its count allows to know immediately if the epub is concerned or not. Unless we created them ourselves, we don't necessarily know a lot about our epubs. It is correct that it is not necessary to use a regex-function to merge note files, this was also written from the #1 message. This is just a small, unpretentious exercise that first shows a use of 'merge' and 'apply_container_update_to_gui'. It uses 'replace.call_after_last_match = True' and shows that the content of the 'return' triumphs over changes in the text by this last call when one would expect the opposite. It gives 2 ways to overcome this constraint. It also shows some data manipulation in the persistent dic 'data'.

Advert

Advert