Class: HTMLReader

Extract the significant text from an arbitrary HTML document. The contents of any head, script, style, and xml tags are removed completely. The URLs for a[href] tags are extracted, along with the inner text of the tag. All other tags are removed, and the inner text is kept intact. Html entities (e.g., &) are not decoded.

Extends

FileReader<Document>

Constructors

new HTMLReader()

new HTMLReader(): HTMLReader

Returns

HTMLReader

Inherited from

FileReader.constructor

Methods

getOptions()

getOptions(): Partial<Opts>

Wrapper for our configuration options passed to string-strip-html library

Returns

Partial<Opts>

An object of options for the underlying library

See

https://codsen.com/os/string-strip-html/examples

Defined in

packages/readers/html/dist/index.d.ts:32

loadData()

loadData(filePath): Promise<Document<Metadata>[]>

Parameters

• filePath: string

Returns

Promise<Document<Metadata>[]>

Inherited from

FileReader.loadData

Defined in

packages/core/schema/dist/index.d.ts:188

loadDataAsContent()

loadDataAsContent(fileContent): Promise<Document<Metadata>[]>

Public method for this reader. Required by BaseReader interface.

Parameters

• fileContent: Uint8Array

The content of the file.

Returns

Promise<Document<Metadata>[]>

Promise<Document[]> A Promise object, eventually yielding zero or one Document parsed from the HTML content of the specified file.

Overrides

FileReader.loadDataAsContent

Defined in

packages/readers/html/dist/index.d.ts:18

parseContent()

parseContent(html, options?): Promise<string>

Wrapper for string-strip-html usage.

Parameters

• html: string

Raw HTML content to be parsed.

• options?: Partial<Opts>

An object of options for the underlying library

Returns

Promise<string>

The HTML content, stripped of unwanted tags and attributes

See

getOptions

Defined in

packages/readers/html/dist/index.d.ts:26

addMetaData()

static addMetaData(filePath): (doc, index) => void

Parameters

• filePath: string

Returns

Function

Parameters

• doc: BaseNode<Metadata>

• index: number

Returns

void

Inherited from

FileReader.addMetaData

Defined in

packages/core/schema/dist/index.d.ts:189

Extends​

Constructors​

new HTMLReader()​

Returns​

Inherited from​

Methods​

getOptions()​

Returns​

See​

Defined in​

loadData()​

Parameters​

Returns​

Inherited from​

Defined in​

loadDataAsContent()​

Parameters​

Returns​

Overrides​

Defined in​

parseContent()​

Parameters​

Returns​

See​

Defined in​

addMetaData()​

Parameters​

Returns​

Parameters​

Returns​

Inherited from​

Defined in​

Extends

Constructors

new HTMLReader()

Returns

Inherited from

Methods

getOptions()

Returns

See

Defined in

loadData()

Parameters

Returns

Inherited from

Defined in

loadDataAsContent()

Parameters

Returns

Overrides

Defined in

parseContent()

Parameters

Returns

See

Defined in

addMetaData()

Parameters

Returns

Parameters

Returns

Inherited from

Defined in