Skip to content

jinglescode/papers

Repository files navigation

Notes on Machine Learning and Medical Research Papers

A collection of research paper summaries, on machine learning and medical (brain computer interface and vision). ML papers are mainly on solving computer vision or sequential problems, and medical papers are focusing on vision problems.

Papers are sorted by topics and tags. Go to the Issues tab to browse, search and filter research papers.

Table of Contents


Machine Learning

Computer vision

Interpretable and Fine-Grained Visual Explanations for Convolutional Neural Networks

  • produce mask to focus on interpretability
  • smallest region of image must be retained to preserve (or deleted to change) model output
  • fine grain visual explanation, no smoothing and regularisations

Stand-Alone Self-Attention in Vision Models

  • self-attention can be an effective stand-alone layer

On the relationship between self-attention and convolutional layers

  • attention layers can perform convolution, they learn to behave similar to convolutional layers
  • multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer

Dynamic Convolution: Attention over Convolution Kernels

  • increases model complexity without increasing the network depth or width
  • single convolution kernel per layer, dynamic convolution aggregates multiple parallel convolution kernels dynamically based upon their attentions, which are input dependent
  • can be easily integrated into existing CNN architectures

Dynamic Group Convolution for Accelerating Convolutional Neural Networks

  • propose dynamic group convolution (DGC) that adaptively selects which part of input channels to be connected within each group for individual samples on the fly
  • introduce a tiny auxiliary feature selector for each group to dynamically decide which part of input channels to be connected based on the activations of all of input channels
  • Multiple groups can adaptively capture abundant and complementary visual/semantic features for each input image
  • similar computational efficiency as the conventional group convolution simultaneously

An image is worth 16x16 words: Transformers for image recognition at scale

  • global image attention by patches
  • learn to attend to patches further away at the lower layers which convnet cannot

End-to-End Video Instance Segmentation with Transformers

  • end-to-end instance segmentation on video frames, tracking the object across frames
  • achieves the highest speed among all existing video instance segmentation models, and achieves the best result

Deep learning-enabled medical computer vision

  • four key considerations when applying ML technologies in healthcare:
    1. assessment of data,
    2. planning for model limitations,
    3. community participation, and
    4. trust building

Bottleneck Transformers for Visual Recognition

  • incorporates self-attention in ResNet's bottleneck blocks, improves instance segmentation and object detection while reducing the parameters.
  • convolution and self-attention can beat ImageNet benchmark, pure attention ViT models struggle in small data regime, but shine in large data regime.

Sequential

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

  • Authors proposed and results show that TCN outperforms RNN, LSTM, and GRU; across a broad range of sequence modeling tasks.
  • TCNs exhibit substantially longer memory, and are thus more suitable for domains where a long history is required.

Machine translation of cortical activity to text with an encoder–decoder framework

  • RNN decoder encoder sequence-to-sequence network that act like a language translation machine, from ECoG to words. Building a representation to map between the 2 different sources.

Speech synthesis from neural decoding of spoken sentences

  • translates neural activity into speech

Wavenet: A generative model for raw audio

  • a deep generative model of audio data that operates directly at the waveform level. WaveNets are autoregressive and combine causal filters with dilated convolutions to allow their receptive fields to grow exponentially with depth, which is important to model the long-range temporal dependencies in audio signals.
  • promising results when applied to music audio modeling and speech recognition

Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation

  • fully-convolutional time-domain audio separation network consists of three processing stages: encoder, separation, and decoder

Convolutional Sequence to Sequence Learning

  • fully convolutional model for sequence to sequence learning
  • use of gated linear units eases gradient propagation
  • separate attention mechanism for each decoder layer
  • outperforms strong recurrent models on very large benchmark datasets

Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

  • fully convolutional encoder and a simple decoder can give superior results to a strong RNN baseline while being an order of magnitude more efficient. Key to the success of the convolutional encoder is a time-depth separable block structure which allows the model to retain a large receptive field

Parallel wavenet: Fast high-fidelity speech synthesis

  • high-fidelity speech synthesis based on WaveNet using Probability Density Distillation

Tacotron: Towards End-to-End Speech Synthesis

  • end-to-end generative text-to-speech model that synthesizes speech directly from characters
  • train from <text, audio> pairs, model takes characters as input and outputs raw spectrogram

Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

  • text encoding is passed to a block-autoregressive decoder using attention, producing conditioning features
  • use location-sensitive attention, which was more stable than the non-content-based GMM attention

Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis

  • using simple location-relative attention mechanisms to do away with content-based query/key comparisons, to handle out-of-domain text
  • introduce a new location-relative attention mechanism to the additive energy-based family, called Dynamic Convolution Attention (DCA)

Pay Less Attention with Lightweight and Dynamic Convolutions

  • introduce dynamic convolutions which are simpler and more efficient than self-attention
  • very lightweight convolution can perform competitively to the best reported self-attention results
  • number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic

Learning representations from EEG with deep recurrent-convolutional neural networks

  • designed to preserve the spatial, spectral, and temporal structure of EEG which leads to finding features that are less sensitive to variations and distortions within each dimension
  • robust to inter- and intra-subject differences, as well as to measurementrelated noise

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

  • a framework for self-supervised learning of representations from raw audio data, wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations

Improved Noisy Student Training for Automatic Speech Recognition

  • adapt and improve noisy student training for automatic speech
  • recognition (noisy student training is an iterative self-training method that leverages augmentation to improve network performance)

Visual to Sound: Generating Natural Sound for Videos in the Wild

  • from video frames to audio

SampleRNN: An unconditional end-to-end neural audio generation model

  • able to capture underlying sources of variations in the temporal sequences over very long time spans
  • using a hierarchy of modules, each operating at a different temporal resolution. lowest module processes individual samples, and each higher module operates on an increasingly longer timescale and a lower temporal resolution

Generating Visually Aligned Sound from Videos

  • RegNet - video sound generation, visually aligned sound, audio forwarding regularizer
  • using GAN, learn a correct mapping between video frames and visually relevant sound

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

  • text-to-speech synthesis, synthesizes the waveform directly without using hand-designed intermediate features (e.g., spectrograms)

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

  • text to speech synthesis, sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms

Sequential: Transformer

Transformer-xl: Attentive language models beyond a fixed-length context

  • enables learning dependency beyond a fixed length without disrupting temporal coherence
  • resolves the context fragmentation problem

Compressive transformers for long-range sequence modelling

  • compress memory mechanism to compress past memories for long-range sequence learning

Reformer: The efficient transformer

  • more memory efficient and faster on long sequences

Music transformer: Generating music with long-term structure

  • relative attention is very well-suited for generative modeling of symbolic music
  • relative attention to much longer sequences such as long texts or even audio waveforms

Conformer: Convolution-augmented Transformer for Speech Recognition

  • combine convolutions with self-attention in ASR models
  • self-attention learns the global interaction whilst the convolutions efficiently capture the relative-offset-based local correlations

Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss

  • use the attention in Transformer-XL and apply to speech recognition
  • end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system

Rethinking Attention with Performers

  • proposes a set of techniques called Fast Attention Via positive Orthogonal Random features (FAVOR+) to approximate softmax self attention in Transformers and achieve better space and time complexity when the sequence length is much higher than feature dimensions
  • Performers, is provably and practically accurate in estimating regular full-rank attention without relying on any priors such as sparsity or low-rankness. It can also be applied to efficiently model other kernalizable attention mechanisms beyond softmax, achieving better empirical results than regular Transformers on some datasets with such strong representation power
  • tested on a rich set of tasks including pixel-prediction, language modeling and protein sequence modeling, and demonstrated competitive results with other examined efficient sparse and dense attention models

Linformer: Self-Attention with Linear Complexity

  • self-attention mechanism can be approximated by a low-rank matrix, reduces the overall self-attention complexity from O(n^2) to O(n) in both time and space.

Transformers are rnns: Fast autoregressive transformers with linear attention

  • reformulates the attention mechanism in terms of kernel functions and obtains a linear formulation, which reduces these requirements. Surprisingly, this formulation also surfaces an interesting connection between autoregressive transformers and RNNs

An image is worth 16x16 words: Transformers for image recognition at scale

  • global image attention by patches
  • learn to attend to patches further away at the lower layers which convnet cannot

Big bird: Transformers for longer sequences

  • scaling up transformers to long sequences, by replacing full quadratic attention mechanism by a combination of random attention, window attention, and global attention
  • allow the processing of longer sequences, translating to state-of-the-art experimental results
  • theoretical guarantees of universal approximation and turing completeness

Long Range Arena : A Benchmark for Efficient Transformers

  • a benchmark for transformer tasks
  • compare performance and speed across xformer models

Earthquake transformer—an attentive deep-learning model for simultaneous earthquake detection and phase picking

  • attention at each multi-task to focus on the useful parts of the waveform for each task

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

  • this paper proofs that sparse transformers can approximate the same as the dense counterpart for any sequence to sequence function

Are Transformers universal approximators of sequence-to-sequence functions?

  • multi-head self-attention layers can indeed compute contextual mappings of the input sequences
  • Transformers can represent any sequence-to-sequence functions, Transformers are universal approximators of continuous and permutation equivariant sequence-to-sequence functions with compact support

Fast Transformers with Clustered Attention

  • improve the computational complexity of self-attention. by cluster the queries into non-overlapping clusters

Transformers with convolutional context for ASR

  • replacing the sinusoidal positional embedding for transformers with convolutionally learned input representations
  • fixed learning rate of 1.0 and no warmup steps

Exploring Transformers for Large-Scale Speech Recognition

  • perform ASL, with streaming approach base on the Transformer-XL network
  • compare BLSTM to Transformer and Transformer-XL

Transformers without Tears: Improving the Normalization of Self-Attention

  • ScaleNorm: normalization with a single scale parameter for faster training and better performance

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

  • reduce space complexity: query sparsity measurement
  • reduce time complexity: ProbSparse
  • predict sequence in one batch: generative style decoder (decoder generates long sequences with 1 forward pass)

Representation learning

Deep Canonical Correlation Analysis

  • learn complex nonlinear transformations of two views of data such that the resulting representations are highly linearly correlated
  • significantly higher correlation than those learned by CCA and KCCA
  • introduce a novel non-saturating sigmoid function based on the cube root

Medical

Brain computer interface

Learning across multi-stimulus enhances target recognition methods in SSVEP-based BCIs

  • covers a variety of CCAs
  • estimate reliable spatial filters and SSVEP templates given small calibration data

Deep Learning-based Classification for Brain-Computer Interfaces

  • comparing CNN and LSTM for 5 classes SSVEP classification

Learning representations from EEG with deep recurrent-convolutional neural networks

  • designed to preserve the spatial, spectral, and temporal structure of EEG which leads to finding features that are less sensitive to variations and distortions within each dimension
  • robust to inter- and intra-subject differences, as well as to measurementrelated noise

Retinotopic and topographic analyses with gaze restriction for steady-state visual evoked potentials

  • findings provide a basis for determining stimulus parameters for neural engineering studies
  • proposed experimental paradigm could also provide a precise framework for future SSVEP-related studies

Steady-state visually evoked potentials: Focus on essential paradigms and future perspectives

  • Provide details on SSVEP, essential to understand and build SSVEP based experiments

Filter bank canonical correlation analysis for implementing a high-speed SSVEP-based brain–computer interface

  • FBCCA

Methods of EEG Signal Features Extraction Using Linear Analysis in Frequency and Time-Frequency Domains

  • different ways to extract features from EEG

MI-EEGNET: A novel Convolutional Neural Network for motor imagery classification

  • outperformed existing methods on BCI Competition IV dataset IIa

A Radial Zoom Motion-Based Paradigm for Steady State Motion Visual Evoked Potentials)

  • radial zoom motion-based SSMVEP paradigm achieves slightly lower accuracy than flicker, but its comfort score and fatigue score is much better than flicker stimulus.

Selective attention to stimulus location modulates the steady-state visual evoked potential

  • a 1996 paper, presented user 2 stimuli, with attention, SSVEP extraction is pausable

Four Novel Motion Paradigms Based on Steady-state Motion Visual Evoked Potential

  • four stimulus paradigms based on basic motion modes: swing, rotation, spiral, and radial contraction-expansion

Highly Interactive Brain–Computer Interface Based on Flicker-Free Steady-State Motion Visual Evoked Potential

  • motion checkerboard stimulation method would keep uniform brightness at all local areas that delivered pure motion stimuli and that motion blur would be further reduced with a high-refresh-rate display to elicit SSMVEPs with a single frequency

Comparison of Modern Highly Interactive Flicker-Free Steady State Motion Visual Evoked Potentials for Practical Brain–Computer Interfaces

  • tested 5 motion based stimuli, motion evoked potentials is more comfortable alternative to flickering visual stimulation
  • comparable performance as flickering visual stimulation

A new dual-frequency stimulation method to increase the number of visual stimuli for multi-class SSVEP-based brain–computer interface

  • as SSVEP frequencies are generally limited by the monitor, this method allow us to increase the number of visual stimuli when necessary
  • ITR 33.26 bits/min, accuracy of 87.23%

Electrophysiological correlates of gist perception: a steady-state visually evoked potentials study

  • multi-stimulus paradigms is suitable to measure brain activity related specifically to each stimulus separately
  • two neighboring stimuli were flickered at different frequencies, SSVEPs enabled us to separate the responses to the two distinct stimuli by extracting oscillatory brain responses
  • succeeded in eliciting oscillatory brain responses at the driving stimuli’s frequencies, their harmonics, and the intermodulation frequency, that is, f1 + f2 = 20.57 Hz
  • brain’s response at a linear combination of two frequencies
  • demonstrates that SSVEPs are an excellent method to unravel mechanisms underlying the processing within multi-stimulus displays in the context of gist perception
  • multiple stimulus displays in combination with the analyses of intermodulation frequencies makes this an ideal approach to investigate gist perception in multi-stimulus processing

Perception of illusory contours forms intermodulation responses of steady state visual evoked potentials as a neural signature of spatial integration

  • spectral decomposition of the measured EEG can show additional peaks at frequencies that are linear combinations of the driving frequencies
  • show that the perception of an illusory rectangle resulted in a significant increase of amplitudes in two intermodulation frequencies

From intermodulation components to visual perception and cognition-a review

  • explore different uses of intermodulation, and review a range of recent studies exploiting intermodulation in visual perception research

Frequency recognition based on canonical correlation analysis for SSVEP-based BCIs

  • important / popular paper on CCA for SSVEP detection

Computational modeling and application of steady-state visual evoked potentials in brain-computer interfaces

  • shows how SSVEP works
  • notes on SSVEP

Spatial Filtering in SSVEP-Based BCIs: Unified Framework and New Improvements

  • described popular CCAs techniques (itCCA, eCCA, Transfer Template CCA, Filter Bank CCA, Task-Related Component Analysis)
  • a unified framework under which the spatial filtering algorithms can be formulated as generalized eigenvalue problems (GEPs) with four different elements: data, temporal filter, orthogonal projection and spatial filter
  • design new spatial filtering algorithms

Spatial Filtering Based on Canonical Correlation Analysis for Classification of Evoked or Event-Related Potentials in EEG Data

  • to increase classification accuracy, spatial filters are used to improve the signal-to-noise ratio of the brain signals and thereby facilitate the detection and classification of SSVEP and other VEPs

SSVEP enhancement based on Canonical Correlation Analysis to improve BCI performances

  • this investigate CCA as a signal enhancement method and not as a feature extraction method
  • make use of the ability of CCA to handle multichannel EEG and find the space in which EEG samples correlate the most with the stimuli
  • CCA yields effective weights (spacial filters) with relatively small training sets

Multiway Canonical Correlation Analysis of Brain Signals

  • CCA does not address the issue of comparing or merging responses across more than two subjects
  • Multiway CCA can be applied effectively to multi-subject datasets of EEG, to denoise the data prior to further analyses, and to summarize the data and reveal traits common across the population of subjects
  • MCCA-based denoising yields significantly better scores in an auditory stimulus-response classification task, and MCCA-based joint analysis of fMRI data reveals detailed subject-specific activation topographies

Spatial smoothing of canonical correlation analysis for steady state visual evoked potential based brain computer interfaces

  • CCA spatial filter becomes spatially smooth to give robustness in short signal length condition

Learning across multi-stimulus enhances target recognition methods in SSVEP-based BCIs

  • to utilize the training data corresponding to not only the target stimulus but also the neighboring stimuli for learning and consequently better performance in learning

An amplitude-modulated visual stimulation for reducing eye fatigue in SSVEP-based brain-computer interfaces

  • reduction of eye fatigue for SSVEP
  • Four targets were used in combinations of three different modulating frequencies and two different carrier frequencies in the offline experiment, and two additional targets were added with one additional modulating and one carrier frequency in online experiments
  • results: caused lower eye fatigue and less sensing of flickering than a low-frequency stimulus, in a manner similar to a high-frequency stimulus

Visual evoked potential and psychophysical contrast thresholds in glaucoma

  • VEP and Psyc thresholds can be quite uncorrelated with glaucoma, the two types of thresholds contain independent information about glaucoma that could be usefully combined

Contrast sensitivity and visual disability in chronic simple glaucoma

  • battery of vision tests was used to quantify visual defect
  • static contrast sensitivity function appears to be the most sensitive method of measuring visual defect in glaucoma patients
  • vertical sinewave gratings

Insights for mfVEPs from perimetry using large spatial frequency-doubling and near frequency-doubling stimuli in glaucoma

  • lower and higher temporal frequency tests probed the same neural mechanism
  • no advantage of spatial frequency-doubling stimuli for mfVEPs

Multifocal frequency-doubling pattern visual evoked responses to dichoptic stimulation

  • results indicated that dichoptic evoked potentials using multifocal frequency-doubling illusion stimuli are practical. The use of crossed orientation, or differing spatial frequencies, in the two eyes reduced binocular interactions.

Vision

A comparison of covert and overt attention as a control option in a steady-state visual evoked potential-based brain computer interface

  • The average accuracy is found to be reduced by ~20% in the switch from overt to covert attention
  • SSVEPs resulting from stimuli located in foveal vision are known to be of large amplitude and very robust
  • stimuli located in peripheral vision generate SSVEPs of much smaller amplitude

Neural Differences between Covert and Overt Attention Studied using EEG with Simultaneous Remote Eye Tracking

  • EEG analysis of the period preceding the saccade latency showed similar occipital response amplitudes for overt and covert shifts, although response latencies differed.
  • combined EEG and eye tracking can be successfully used to study natural overt shifts of attention
  • There were no striking differences in early response components between overt and covert shifts in fronto-central areas
  • Most studies of covert vs. overt attention involve instructing the participant to attend to a particular region of the field via a centrally presented cue, and so can be considered as an endogenous direction of attention. In contrast, our experiment provided an exogenous trigger for attention, by the appearance of a target in a peripheral field location. Thus it is possible that a different pattern of activation would be seen in the covert direction of attention by an endogenous cue

Visual field testing for glaucoma – a practical guide

  • This gives a good idea of what glaucoma patients see, useful when considering development test cases.

Walking enhances peripheral visual processing in humans

  • walking leads to an increased processing of peripheral input
  • increased contrast sensitivity for peripheral compared to central stimuli when subjects were walking

The steady-state visual evoked potential in vision research: A review

  • provided details about SSVEP
  • applications of SSVEP

Multifocal Visual Evoked Potential (mfVEP) and Pattern-Reversal Visual Evoked Potential Changes in Patients with Visual Pathway Disorders: A Case Series

  • mfVEP may provide a more accurate assessment of visual defects when compared with PVEP
  • demonstrates that mfVEP, as an objective test for visual fields, is potentially more sensitive than PVEP in detecting focal visual pathway pathology

Study for Analysis of the Multifocal Visual Evoked Potential

  • To introduce the clinical utility of the absolute value of the reconstructed waveform method in the analysis of multifocal visual evoked potential (mfVEP).

Multifocal visual evoked potentials for quantifying optic nerve dysfunction in patients with optic disc drusen

  • To explore the applicability of multifocal visual evoked potentials (mfVEPs) for research and clinical diagnosis in patients with optic disc drusen (ODD). This is the first assessment of mfVEP amplitude in patients with ODD.

Steady-state multifocal visual evoked potential (ssmfVEP) using dartboard stimulation as a possible tool for objective visual field assessment

  • To investigate whether a conventional, monitor-based multifocal visual evoked potential (mfVEP) system can be used to record steady-state mfVEP (ssmfVEP) in healthy subjects and to study the effects of temporal frequency, electrode configuration and alpha waves.

A Review of Deep Learning for Screening, Diagnosis, and Detection of Glaucoma Progression

  • review on deep learning and ophthalmology

Objective visual field determination in forensic ophthalmology with an optimized 4-channel multifocal VEP perimetry system: a case report of a patient with retinitis pigmentosa

  • objective technique evaluating cortical activity, mfVEP was able to proof the concentric reduction of the visual field in this patient with late-stage retinitis pigmentosa

An oblique effect in parafoveal motion perception

  • discriminate angular direction of moving grating in different frequency and in parafoveea

Choice of Grating Orientation for Evaluation of Peripheral Vision

  • evaluate peripheral resolution and detection for different orientations in different visual field meridians

Motion Perception in the Peripheral Visual Field

  • peripheral perception for motion stimulus

Development of Grating Acuity and Contrast Sensitivity in the Central and Peripheral Visual Field of the Human Infant

  • analyze EEG of stimulus at peripheral (8 to 16 degree), to determine the separate responses for central and peripheral fields
  • no infant in this study had higher acuity in the peripheral field than in the central field
  • well, the peripheral field is relatively more mature at birth

Motion perception in the peripheral visual field

  • performance in the temporal hemified was slightly superior to that in the nasal hemifield and depended on the orientation as well as on the direction of the motion.
  • perception of horizontal motion was better than that of vertical motion.
  • in spite of large variations, centrifugal motion was significantly more readily perceived than centripetal motion.

Speed of visual processing increases with eccentricity

  • the fovea has the resolution required to process fine spatial information, but the periphery is more sensitive to temporal properties
  • speed of information processing varies with eccentricity: processing was faster when same-size stimuli appeared at 9° than 4° eccentricity
  • at the same eccentricity, larger stimuli are processed more slowly

Stimulus dependencies of an illusory motion: Investigations of the Motion Bridging Effect

  • using ring of points, check retinal eccentricity with various configurations

Ehud Kaplan on Receptive fields

  • about M-cells and the P-cells